CS7545_Sp23_Lecture_03: Convex Analysis - mltheory/CS7545 GitHub Wiki
CS 7545: Machine Learning Theory -- Spring 2023
Instructor: Jacob Abernethy
Notes for Lecture 03 - Convex Analysis
January 17, 2023
Scribes: Julia Grigni, Salil Kamath
Announcements
Scribes should ensure that a draft of the lecture notes is available within 72 hours of the lecture.
If you find inconsistencies between markdown and TeX while scribing, note them in the appropriate page in the wiki.
The teaching team in currently deciding on whether or not to accept scanned, handwritten homework solutions; if you feel strongly about this make a post on Piazza.
Derivatives
Definition 3.1 (Gradient)
Let $X\subseteq \mathbb{R}^n$, $f:X\rightarrow \mathbb{R}$. Assume $f$ is twice differentiable.
The gradient of $f$ at $x$ written as $\nabla f(x)$ is defined as a vector of partial derivatives. $$\nabla f(x) = \left [ \frac{\partial f}{\partial x_i} \right ]_{i=1,...,n}$$
Definition 3.2 (Hessian)
The Hessian of $f$ at $x$ is the matrix
$$\nabla ^2 f(x) = \left [ \frac{\partial^2 f}{\partial x_i x_j}\right ]_{1\leq i \leq n; 1\leq j \leq n}$$ Note:$\nabla ^2 f(x)$ is always symmetric.
Theorem 3.1 (Intermediate Value Theorem)
For differentiable $g$, $\forall x,y\in domain(g)$$$g(x)-g(y)=\langle \nabla g(z), x-y \rangle$$ for some $z$ "between $x$ and $y$".
Lipschitz
Definition 3.3 (c-Lipschitz)
We say a function $f$ is $c$-Lipschitz $\left ( c\geq0 \right )$ with respect to a norm $\lVert\cdot\rVert$ if
$\forall x,y \in domain(f)$$$f(x)-f(y) \leq c\lVert x-y\rVert$$
Intuition: $f$ is c-Lipschitz if it doesn't change very fast; the gradient is bounded by $c$ w.r.t. unit distance.
Examples:
$\sigma(x)=\frac{1}{1+e^{-x}}$ is Lipschitz.
$log(x)$ is not Lipschitz.
Theorem 3.2
If $f$ is differentiable, then $f$ is $c$-Lipschitz iff $\forall x\in domain(f)$$$\lVert\nabla f(x)\rVert_* \leq c$$
Note: The theorem applies to the dual norm.
Proof: Assume $\forall x \in domain(f)$, $\lVert\nabla f(x)\rVert_* \leq c$. We can say
$$\begin{align*}f(x)-f(y)&=\nabla f(z)^\top (x-y) && \text{for some $z$ by Intermediate Value Theorem} \\
f(x)-f(y)&\leq \lVert\nabla f(z)\rVert_* \lVert x-y\rVert && \text{by Hölder's inequality}\\
&\leq c\lVert x-y\rVert && \text{by the assumption} \\
\text{By definition, f is c-Lipschitz}
\end{align*}$$
Convexity
Definition 3.4 (Convex Set)
A set $\mathcal{U}\subseteq \mathbb{R}^n$ is convex if $\forall x,y \in \mathcal{U}$, $\frac{x+y}{2}\in \mathcal{U}$. (midpoint must be in the set)
Definition 3.5 (Convex Function)
A function $f$ is convex if $domain(f)$ is convex and $\forall x,y \in domain(f)$$$f(\frac{x+y}{2})\leq \frac{f(x)+f(y)}{2}$$
That is, the linear interpolation has to lie above the function.
Theorem 3.4
Let $f$ be differentiable. Then $f$ is convex iff $\forall x,y \in domain(f)$$$f(y)\geq f(x) + \langle \nabla f(x), y-x\rangle$$
That is, the linear approximation is a lower bound.
Theorem 3.5 (Jensen's Inequality)
Jensen's Inequality states that $f$ is convex if and only if, for all random variables $X$, $f[\mathbb{E}(X)] \leq \mathbb{E}[f(X)]$ .
This is an equivalent, alternate definition of convexity.
It is fine to think of this as the true definition of convex functions if that makes more sense to you.
Theorem 3.6 (Second Order Condition)
Let $f$ be twice-differentiable. Then, $f$ is convex if and only if, $\forall x \in domain(f)$,
$$\nabla ^2 f(x) \succeq 0 $$
The intuition here is that, if one thinks of the "curvature" as being non-negative, that means the Hessian should be positive semi-definite (PSD).
Example:
Let $\vec{x}$ be a 2-dimensional vector, and let $f(\vec{x}) = (x_1 - x_2)^2 = x_1^2 + x_2^2 - 2x_1 x_2$.
Next, we want to find the eigenvalues of $\nabla^2 f(x)$. Since $\nabla^2 f(x)$ is a $2 \times 2$ matrix with rank 1, one of the eigenvalues is 0. Furthermore, since $$tr(A) = \sum_{i=1}^{n}\lambda_i$$ the other eigenvalue must be 4.
Since we have $\lambda_1 = 4$, $\lambda_2=0$, that means $\nabla^2 f(x)$ is PSD. Therefore, $f$ is convex.
Exercise (Counterintuitive PSD):
Let $M$ be a PSD matrix. Is it possible that $M_{ii} < 0$?
No, since we would have $M_{ii} = \vec{e}_i^T M \vec{e}_i < 0$, which means that $M$ is not PSD.
Can a matrix with only positive entries have a negative eigenvalue?
Yes, consider the following example:
$$\begin{bmatrix} 1 & 2 \\
2 & 1\end{bmatrix}$$
This matrix has eigenvalues $\lambda_1=3$, $\lambda_2 = -1$, as found below:
Moral of the story: PSD matrices are not intuitive, and you need to prove everything.
Note: There are at least 2 ways to check whether a matrix is PSD:
Compute all of the eigenvalues and check that $\lambda_i \geq 0$$\forall i \in {1, ..., n}$
Silvester's Criterion: Let $M$ be an $n \times n$ Hermitian matrix. Then, $M$ is positive-semidefinite if and only if all the principal minors are nonnegative.
Real symmetric matrices are all Hermitian.
A principal minor is the square matrix left over after deleting the row $i$ and column $i$ from $M$.
Strong Convexity and Smoothness
Definition 3.6 ($\mu$-strongly convex)
A differentiable function $f$ is $\mu$-strongly convex with respect to norm $\lVert \cdot \rVert$ if, $\forall x, y \in domain(f)$,
$$f(x) \geq f(y) + \langle\nabla f(y), x-y\rangle + \frac{\mu}{2} \lVert y-x \rVert ^2 $$
Intuition behind this definition and convexity
With convexity, the function cannot curve down, it can only curve up.
With strong convexity, the function must curve up by at least some amount.
Definition 3.7 ($\mu$-smooth)
A differentiable function $f$ is $\mu$-smooth with respect to norm $\lVert \cdot \rVert$ if, $\forall x, y \in domain(f)$,
$$f(x) \leq f(y) + \langle\nabla f(y), x - y\rangle + \frac{\mu}{2} \lVert y-x \rVert ^2 $$
Lemma 3.1 (Shifting eigenvalues)
Let $\alpha \in \mathbb{R}$, and let $M$ be a PSD matrix. Denote the eigenvalues of a matrix M as $eigs(M)$, in the form of a row vector. Then,
Note:$(\mu$-strong convexity, $\mu$-smoothness and twice differentiability)
If $f$ is twice differentiable, then
$f$ is $\mu$-strongly convex w.r.t. the 2-norm if and only if, $\forall x \in domain(f)$, $\nabla ^2 f(x) \succeq \mu I $, or equivalently, $\nabla ^2 f(x) - \mu I \succeq 0 $
By Lemma 3.1, this is equivalent to the statement that all the eigenvalues of $\nabla ^2 f(x)$ are at least $\mu$.
$f$ is $\mu$-smooth w.r.t. the 2-norm if and only if, $\forall x \in domain(f)$, $\nabla ^2 f(x) \preceq \mu I $
Note: Smoothness does not depend on convexity, and a function can be neither, both, or either one. For example, $f(x)= \frac{1}{2} \lVert x \rVert ^2$ is both $1$-smooth and $1$-strongly convex.