CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Jacob Abernethy

Notes for Lecture 03 - Convex Analysis

January 17, 2023

Scribes: Julia Grigni, Salil Kamath

Announcements

Scribes should ensure that a draft of the lecture notes is available within 72 hours of the lecture.
If you find inconsistencies between markdown and TeX while scribing, note them in the appropriate page in the wiki.
The teaching team in currently deciding on whether or not to accept scanned, handwritten homework solutions; if you feel strongly about this make a post on Piazza.

Derivatives

Definition 3.1 (Gradient)
Let $X\subseteq \mathbb{R}^n$, $f:X\rightarrow \mathbb{R}$. Assume $f$ is twice differentiable. The gradient of $f$ at $x$ written as $\nabla f(x)$ is defined as a vector of partial derivatives.
$$\nabla f(x) = \left [ \frac{\partial f}{\partial x_i} \right ]_{i=1,...,n}$$

Definition 3.2 (Hessian)
The Hessian of $f$ at $x$ is the matrix $$\nabla ^2 f(x) = \left [ \frac{\partial^2 f}{\partial x_i x_j}\right ]_{1\leq i \leq n; 1\leq j \leq n}$$
Note: $\nabla ^2 f(x)$ is always symmetric.

Theorem 3.1 (Intermediate Value Theorem)
For differentiable $g$, $\forall x,y\in domain(g)$ $$g(x)-g(y)=\langle \nabla g(z), x-y \rangle$$ for some $z$ "between $x$ and $y$".

Lipschitz

Definition 3.3 (c-Lipschitz)
We say a function $f$ is $c$-Lipschitz $\left ( c\geq0 \right )$ with respect to a norm $\lVert\cdot\rVert$ if $\forall x,y \in domain(f)$ $$f(x)-f(y) \leq c\lVert x-y\rVert$$

Intuition: $f$ is c-Lipschitz if it doesn't change very fast; the gradient is bounded by $c$ w.r.t. unit distance.

Examples:

$\sigma(x)=\frac{1}{1+e^{-x}}$ is Lipschitz.
$log(x)$ is not Lipschitz.

Theorem 3.2
If $f$ is differentiable, then $f$ is $c$-Lipschitz iff $\forall x\in domain(f)$ $$\lVert\nabla f(x)\rVert_* \leq c$$

Note: The theorem applies to the dual norm.

Proof: Assume $\forall x \in domain(f)$, $\lVert\nabla f(x)\rVert_* \leq c$. We can say

$$\begin{align*}f(x)-f(y)&=\nabla f(z)^\top (x-y) && \text{for some $z$ by Intermediate Value Theorem} \\ f(x)-f(y)&\leq \lVert\nabla f(z)\rVert_* \lVert x-y\rVert && \text{by Hölder's inequality}\\ &\leq c\lVert x-y\rVert && \text{by the assumption} \\ \text{By definition, f is c-Lipschitz} \end{align*}$$

Convexity

Definition 3.4 (Convex Set)
A set $\mathcal{U}\subseteq \mathbb{R}^n$ is convex if $\forall x,y \in \mathcal{U}$, $\frac{x+y}{2}\in \mathcal{U}$. (midpoint must be in the set)

Definition 3.5 (Convex Function)
A function $f$ is convex if $domain(f)$ is convex and $\forall x,y \in domain(f)$ $$f(\frac{x+y}{2})\leq \frac{f(x)+f(y)}{2}$$ That is, the linear interpolation has to lie above the function.

Theorem 3.3
$f$ is convex iff $\forall x,y \in domain(f)$, $\alpha \in [0,1]$,
$$f(\alpha x +(1-\alpha) y) \leq \alpha f(x) + (1-\alpha) f(y)$$

Exercise: Prove Theorem 3.3.

Theorem 3.4
Let $f$ be differentiable. Then $f$ is convex iff $\forall x,y \in domain(f)$ $$f(y)\geq f(x) + \langle \nabla f(x), y-x\rangle$$ That is, the linear approximation is a lower bound.

Theorem 3.5 (Jensen's Inequality)
Jensen's Inequality states that $f$ is convex if and only if, for all random variables $X$, $f[\mathbb{E}(X)] \leq \mathbb{E}[f(X)]$ .

This is an equivalent, alternate definition of convexity.
- It is fine to think of this as the true definition of convex functions if that makes more sense to you.

Theorem 3.6 (Second Order Condition)
Let $f$ be twice-differentiable. Then, $f$ is convex if and only if, $\forall x \in domain(f)$, $$\nabla ^2 f(x) \succeq 0 $$

The intuition here is that, if one thinks of the "curvature" as being non-negative, that means the Hessian should be positive semi-definite (PSD).

Example:
Let $\vec{x}$ be a 2-dimensional vector, and let $f(\vec{x}) = (x_1 - x_2)^2 = x_1^2 + x_2^2 - 2x_1 x_2$.

$$\begin{align*} \nabla f(x) = \begin{bmatrix} 2x_1 - 2x_2 && 2x_2 - 2x_1 \end{bmatrix} \end{align*}$$

$$\begin{align*} \nabla^2 f(x) &= \begin{bmatrix} 2 & -2 \\ -2 & 2 \end{bmatrix}\\ \end{align*}$$

Next, we want to find the eigenvalues of $\nabla^2 f(x)$. Since $\nabla^2 f(x)$ is a $2 \times 2$ matrix with rank 1, one of the eigenvalues is 0. Furthermore, since $$tr(A) = \sum_{i=1}^{n}\lambda_i$$ the other eigenvalue must be 4.

Since we have $\lambda_1 = 4$, $\lambda_2=0$, that means $\nabla^2 f(x)$ is PSD. Therefore, $f$ is convex.

Exercise (Counterintuitive PSD):

Let $M$ be a PSD matrix. Is it possible that $M_{ii} < 0$?
No, since we would have $M_{ii} = \vec{e}_i^T M \vec{e}_i < 0$, which means that $M$ is not PSD.
Can a matrix with only positive entries have a negative eigenvalue?
Yes, consider the following example:

$$\begin{bmatrix} 1 & 2 \\ 2 & 1\end{bmatrix}$$

This matrix has eigenvalues $\lambda_1=3$, $\lambda_2 = -1$, as found below:

$$\begin{align*} 0 &= \begin{vmatrix} 1 - \lambda & 2 \\ 2 & 1 - \lambda \end{vmatrix}\\ &= (1-\lambda)^2 - 4 \\ &= 1 - 2\lambda + \lambda^2 - 4 \\ &= \lambda^2 - 2\lambda - 3 \\ &= (\lambda + 1) (\lambda - 3) \text{, which yields } \lambda_1=3, \lambda_2 = -1 \end{align*}$$

Moral of the story: PSD matrices are not intuitive, and you need to prove everything.

Note: There are at least 2 ways to check whether a matrix is PSD:

Compute all of the eigenvalues and check that $\lambda_i \geq 0$ $\forall i \in {1, ..., n}$
Silvester's Criterion: Let $M$ be an $n \times n$ Hermitian matrix. Then, $M$ is positive-semidefinite if and only if all the principal minors are nonnegative.
- Real symmetric matrices are all Hermitian.
- A principal minor is the square matrix left over after deleting the row $i$ and column $i$ from $M$.

Strong Convexity and Smoothness

Definition 3.6 ($\mu$-strongly convex)
A differentiable function $f$ is $\mu$-strongly convex with respect to norm $\lVert \cdot \rVert$ if, $\forall x, y \in domain(f)$, $$f(x) \geq f(y) + \langle\nabla f(y), x-y\rangle + \frac{\mu}{2} \lVert y-x \rVert ^2 $$

Intuition behind this definition and convexity
- With convexity, the function cannot curve down, it can only curve up.
- With strong convexity, the function must curve up by at least some amount.

Definition 3.7 ($\mu$-smooth)
A differentiable function $f$ is $\mu$-smooth with respect to norm $\lVert \cdot \rVert$ if, $\forall x, y \in domain(f)$, $$f(x) \leq f(y) + \langle\nabla f(y), x - y\rangle + \frac{\mu}{2} \lVert y-x \rVert ^2 $$

Lemma 3.1 (Shifting eigenvalues)
Let $\alpha \in \mathbb{R}$, and let $M$ be a PSD matrix. Denote the eigenvalues of a matrix M as $eigs(M)$, in the form of a row vector. Then,

$$eigs(M-\alpha I) = eigs(M) - \begin{bmatrix} \alpha & \alpha & ... & \alpha \end{bmatrix} $$

Note: $(\mu$-strong convexity, $\mu$-smoothness and twice differentiability) If $f$ is twice differentiable, then

$f$ is $\mu$-strongly convex w.r.t. the 2-norm if and only if, $\forall x \in domain(f)$, $\nabla ^2 f(x) \succeq \mu I $, or equivalently, $\nabla ^2 f(x) - \mu I \succeq 0 $
- By Lemma 3.1, this is equivalent to the statement that all the eigenvalues of $\nabla ^2 f(x)$ are at least $\mu$.
$f$ is $\mu$-smooth w.r.t. the 2-norm if and only if, $\forall x \in domain(f)$, $\nabla ^2 f(x) \preceq \mu I $

Note: Smoothness does not depend on convexity, and a function can be neither, both, or either one. For example, $f(x)= \frac{1}{2} \lVert x \rVert ^2$ is both $1$-smooth and $1$-strongly convex.

CS7545_Sp23_Lecture_03: Convex Analysis - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Jacob Abernethy

Notes for Lecture 03 - Convex Analysis

January 17, 2023

Scribes: Julia Grigni, Salil Kamath

Announcements

Derivatives

Lipschitz

Convexity

Strong Convexity and Smoothness

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp23_Lecture_03: Convex Analysis - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Jacob Abernethy

Notes for Lecture 03 - Convex Analysis

January 17, 2023

Scribes: Julia Grigni, Salil Kamath

Announcements

Derivatives

Lipschitz

Convexity

Strong Convexity and Smoothness

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️