CS7545_Sp23_Lecture_03: Convex Analysis - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Jacob Abernethy

Notes for Lecture 03 - Convex Analysis

January 17, 2023

Scribes: Julia Grigni, Salil Kamath

Announcements

  • Scribes should ensure that a draft of the lecture notes is available within 72 hours of the lecture.
  • If you find inconsistencies between markdown and TeX while scribing, note them in the appropriate page in the wiki.
  • The teaching team in currently deciding on whether or not to accept scanned, handwritten homework solutions; if you feel strongly about this make a post on Piazza.

Derivatives

Definition 3.1 (Gradient)
Let $X\subseteq \mathbb{R}^n$, $f:X\rightarrow \mathbb{R}$. Assume $f$ is twice differentiable. The gradient of $f$ at $x$ written as $\nabla f(x)$ is defined as a vector of partial derivatives.
$$\nabla f(x) = \left [ \frac{\partial f}{\partial x_i} \right ]_{i=1,...,n}$$

Definition 3.2 (Hessian)
The Hessian of $f$ at $x$ is the matrix $$\nabla ^2 f(x) = \left [ \frac{\partial^2 f}{\partial x_i x_j}\right ]_{1\leq i \leq n; 1\leq j \leq n}$$
Note: $\nabla ^2 f(x)$ is always symmetric.

Theorem 3.1 (Intermediate Value Theorem)
For differentiable $g$, $\forall x,y\in domain(g)$ $$g(x)-g(y)=\langle \nabla g(z), x-y \rangle$$ for some $z$ "between $x$ and $y$".

Lipschitz

Definition 3.3 (c-Lipschitz)
We say a function $f$ is $c$-Lipschitz $\left ( c\geq0 \right )$ with respect to a norm $\lVert\cdot\rVert$ if $\forall x,y \in domain(f)$ $$f(x)-f(y) \leq c\lVert x-y\rVert$$

  • Intuition: $f$ is c-Lipschitz if it doesn't change very fast; the gradient is bounded by $c$ w.r.t. unit distance.

Examples:

  1. $\sigma(x)=\frac{1}{1+e^{-x}}$ is Lipschitz.
  2. $log(x)$ is not Lipschitz.

Theorem 3.2
If $f$ is differentiable, then $f$ is $c$-Lipschitz iff $\forall x\in domain(f)$ $$\lVert\nabla f(x)\rVert_* \leq c$$

Note: The theorem applies to the dual norm.

Proof: Assume $\forall x \in domain(f)$, $\lVert\nabla f(x)\rVert_* \leq c$. We can say

$$\begin{align*}f(x)-f(y)&=\nabla f(z)^\top (x-y) && \text{for some $z$ by Intermediate Value Theorem} \\ f(x)-f(y)&\leq \lVert\nabla f(z)\rVert_* \lVert x-y\rVert && \text{by Hölder's inequality}\\ &\leq c\lVert x-y\rVert && \text{by the assumption} \\ \text{By definition, f is c-Lipschitz} \end{align*}$$

Convexity

Definition 3.4 (Convex Set)
A set $\mathcal{U}\subseteq \mathbb{R}^n$ is convex if $\forall x,y \in \mathcal{U}$, $\frac{x+y}{2}\in \mathcal{U}$. (midpoint must be in the set)

Definition 3.5 (Convex Function)
A function $f$ is convex if $domain(f)$ is convex and $\forall x,y \in domain(f)$ $$f(\frac{x+y}{2})\leq \frac{f(x)+f(y)}{2}$$ That is, the linear interpolation has to lie above the function.

Theorem 3.3
$f$ is convex iff $\forall x,y \in domain(f)$, $\alpha \in [0,1]$,
$$f(\alpha x +(1-\alpha) y) \leq \alpha f(x) + (1-\alpha) f(y)$$

Exercise: Prove Theorem 3.3.

Theorem 3.4
Let $f$ be differentiable. Then $f$ is convex iff $\forall x,y \in domain(f)$ $$f(y)\geq f(x) + \langle \nabla f(x), y-x\rangle$$ That is, the linear approximation is a lower bound.

Theorem 3.5 (Jensen's Inequality)
Jensen's Inequality states that $f$ is convex if and only if, for all random variables $X$, $f[\mathbb{E}(X)] \leq \mathbb{E}[f(X)]$ .

  • This is an equivalent, alternate definition of convexity.
    • It is fine to think of this as the true definition of convex functions if that makes more sense to you.

Theorem 3.6 (Second Order Condition)
Let $f$ be twice-differentiable. Then, $f$ is convex if and only if, $\forall x \in domain(f)$, $$\nabla ^2 f(x) \succeq 0 $$

  • The intuition here is that, if one thinks of the "curvature" as being non-negative, that means the Hessian should be positive semi-definite (PSD).

Example:
Let $\vec{x}$ be a 2-dimensional vector, and let $f(\vec{x}) = (x_1 - x_2)^2 = x_1^2 + x_2^2 - 2x_1 x_2$.

$$\begin{align*} \nabla f(x) = \begin{bmatrix} 2x_1 - 2x_2 && 2x_2 - 2x_1 \end{bmatrix} \end{align*}$$

$$\begin{align*} \nabla^2 f(x) &= \begin{bmatrix} 2 & -2 \\ -2 & 2 \end{bmatrix}\\ \end{align*}$$

Next, we want to find the eigenvalues of $\nabla^2 f(x)$. Since $\nabla^2 f(x)$ is a $2 \times 2$ matrix with rank 1, one of the eigenvalues is 0. Furthermore, since $$tr(A) = \sum_{i=1}^{n}\lambda_i$$ the other eigenvalue must be 4.

Since we have $\lambda_1 = 4$, $\lambda_2=0$, that means $\nabla^2 f(x)$ is PSD. Therefore, $f$ is convex.

Exercise (Counterintuitive PSD):

  1. Let $M$ be a PSD matrix. Is it possible that $M_{ii} < 0$?
    No, since we would have $M_{ii} = \vec{e}_i^T M \vec{e}_i < 0$, which means that $M$ is not PSD.

  2. Can a matrix with only positive entries have a negative eigenvalue?
    Yes, consider the following example:

$$\begin{bmatrix} 1 & 2 \\ 2 & 1\end{bmatrix}$$

This matrix has eigenvalues $\lambda_1=3$, $\lambda_2 = -1$, as found below:

$$\begin{align*} 0 &= \begin{vmatrix} 1 - \lambda & 2 \\ 2 & 1 - \lambda \end{vmatrix}\\ &= (1-\lambda)^2 - 4 \\ &= 1 - 2\lambda + \lambda^2 - 4 \\ &= \lambda^2 - 2\lambda - 3 \\ &= (\lambda + 1) (\lambda - 3) \text{, which yields } \lambda_1=3, \lambda_2 = -1 \end{align*}$$

Moral of the story: PSD matrices are not intuitive, and you need to prove everything.

Note: There are at least 2 ways to check whether a matrix is PSD:

  1. Compute all of the eigenvalues and check that $\lambda_i \geq 0$ $\forall i \in {1, ..., n}$
  2. Silvester's Criterion: Let $M$ be an $n \times n$ Hermitian matrix. Then, $M$ is positive-semidefinite if and only if all the principal minors are nonnegative.
    • Real symmetric matrices are all Hermitian.
    • A principal minor is the square matrix left over after deleting the row $i$ and column $i$ from $M$.

Strong Convexity and Smoothness

Definition 3.6 ($\mu$-strongly convex)
A differentiable function $f$ is $\mu$-strongly convex with respect to norm $\lVert \cdot \rVert$ if, $\forall x, y \in domain(f)$, $$f(x) \geq f(y) + \langle\nabla f(y), x-y\rangle + \frac{\mu}{2} \lVert y-x \rVert ^2 $$

  • Intuition behind this definition and convexity
    • With convexity, the function cannot curve down, it can only curve up.
    • With strong convexity, the function must curve up by at least some amount.

Definition 3.7 ($\mu$-smooth)
A differentiable function $f$ is $\mu$-smooth with respect to norm $\lVert \cdot \rVert$ if, $\forall x, y \in domain(f)$, $$f(x) \leq f(y) + \langle\nabla f(y), x - y\rangle + \frac{\mu}{2} \lVert y-x \rVert ^2 $$

Lemma 3.1 (Shifting eigenvalues)
Let $\alpha \in \mathbb{R}$, and let $M$ be a PSD matrix. Denote the eigenvalues of a matrix M as $eigs(M)$, in the form of a row vector. Then,

$$eigs(M-\alpha I) = eigs(M) - \begin{bmatrix} \alpha & \alpha & ... & \alpha \end{bmatrix} $$

Note: $(\mu$-strong convexity, $\mu$-smoothness and twice differentiability) If $f$ is twice differentiable, then

  1. $f$ is $\mu$-strongly convex w.r.t. the 2-norm if and only if, $\forall x \in domain(f)$, $\nabla ^2 f(x) \succeq \mu I $, or equivalently, $\nabla ^2 f(x) - \mu I \succeq 0 $
    • By Lemma 3.1, this is equivalent to the statement that all the eigenvalues of $\nabla ^2 f(x)$ are at least $\mu$.
  2. $f$ is $\mu$-smooth w.r.t. the 2-norm if and only if, $\forall x \in domain(f)$, $\nabla ^2 f(x) \preceq \mu I $

Note: Smoothness does not depend on convexity, and a function can be neither, both, or either one. For example, $f(x)= \frac{1}{2} \lVert x \rVert ^2$ is both $1$-smooth and $1$-strongly convex.

⚠️ **GitHub.com Fallback** ⚠️