CS7545_Sp23_Lecture_04: Convex Analysis Continued and Statistics Review - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Jacob Abernethy

Notes for Lecture 04

January 19, 2023

Scribes: Suvan Adhikari, Sooraj Karthik

Announcements

  • Homework must be typed, but using LaTeX is optional.
  • Office Hours will be posted soon. These office hours will be done primarily virtually, but in-person office hours are possible.

Fenchel Conjugates

Definition 4.1 (Fenchel Conjugate)
Given a convex function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, its Fenchel conjugate, or dual, $f^*$ is defined as:

$$f^*(\theta) = \sup_{x\in dom(f)}(\langle x, \theta \rangle - f(x))$$

  • Intuition: In a strictly convex function $f$, there is a unique mapping $x \leftrightarrow \nabla f(x)$. The Fenchel conjugate maps from the gradient back to the input of the function.
    • Note that a strictly convex function is one where the line segment connecting any two points on the function lies strictly above the function. You can think of this meaning that the function doesn't have any "flat" regions or that it is "curved up a little bit everywhere"
    • Mathematically, this means for all $\alpha\in[0,1]$ and $x,y\in dom(f)$:

$$f(\alpha x + (1 - \alpha) y) < \alpha f(x) + (1 - \alpha)f(y)$$

  • Note that $f^*$ may have a different domain than $f$ so we use $\theta$ as the argument for $f^*$ rather than $x$ to avoid any confusion.

Examples:

  1. The dual of $f(x) = \frac{1}{2}(||x||_2)^2$ is $f^*(\theta) = \frac{1}{2}(||\theta||_2)^2$ (Try deriving this as an exercise).
  2. The dual of $f(x) = \frac{1}{2}x^TMx$, where M is PSD, is $f^*(\theta) = \frac{1}{2}\theta^TM^{-1}\theta$.

Proof:
For positive definite matrix M,

$$\begin{align*} f^*(\theta) &= \sup_{x\in dom(f)}(\langle x,\theta\rangle - f(x)) & &\text{definition of Fenchel conjugate} \\ &= \sup_{x\in dom(f)}(\langle x,\theta\rangle - \frac{1}{2}x^TMx) & &\text{substituting value for f(x)} \\ \end{align*}$$

To solve for the supremum, we let $\Phi(x) = \langle x,\theta\rangle - \frac{1}{2}x^TMx$ and see where $\nabla_x\Phi(x)=0$ $$\nabla_x\Phi(x)=\theta-Mx=0 \Rightarrow\theta=Mx\Rightarrow x = M^{-1}\theta \text{\ \ \ (All positive definite matrices are invertible)}$$

Proof for gradient of $\langle x,\theta\rangle$ can be found here and proof for gradient of $x^TMx$ can be found here.

Now we can just substitute the value we got for $x$ back into our original equation:

$$ \begin{align*} f^*(\theta)&=\langle M^{-1}\theta,\theta\rangle - \frac{1}{2}(M^{-1}\theta)^TM(M^{-1}\theta) & &\text{substitute $x=M^{-1}\theta$ into original equation} \\ f^*(\theta)&=\theta^T(M^{-1})^T\theta - \frac{1}{2}\theta^T(M^{-1})^TMM^{-1}\theta & &\text{use $\langle a,b\rangle=a^Tb$ and $(Ab)^T=b^TA^T$} \\ f^*(\theta)&=\theta^TM^{-1}\theta - \frac{1}{2}\theta^TM^{-1}\theta & &\text{inverse of PD matrix is PD so $(M^{-1})^T=M^{-1}$} \\ f^*(\theta)&= \frac{1}{2}\theta^TM^{-1}\theta \end{align*} $$

Facts about Fenchel Conjugates

  • If f is closed, then $(f^*)^* = f$.
  • If f is differentiable and strictly convex, then $\nabla f$ and $\nabla (f^*)$ are inverse maps. That is, $\nabla f^*(\nabla f(x)) = x$ and $\nabla f(\nabla f^*(\theta)) = \theta$ (think about how this relates to the intuition mentioned earlier).

Theorem 4.1 (Fenchel-Young Inequality) Given a convex function $f$, $f(x) + f^*(\theta) \geq \langle x, \theta \rangle$.

Proof:

$$ \begin{align*} f(x)+f^*(\theta) &= f(x)+\sup_{x'\in dom(f)}(\langle x',\theta\rangle-f(x')) & &\text{using definition of supremum} \\ &\geq f(x)+\langle x,\theta\rangle -f(x) & &\text{pick a specific $x$ instead of taking supremum} \\ &= \langle x,\theta \rangle \end{align*} $$

This can actually be used to prove the result of Young's Inequality since the Fenchel Conjugate of $(\lVert\cdot\rVert_p)^p$ is $(\lVert\cdot\rVert_q)^q$ when $\frac{1}{p}+\frac{1}{q}=1$.

Bregman Divergence

Definition 4.2 (Bregman Divergence)
For a convex, differentiable function $f$, the Bregman Divergence of $f$ between $x,y\in dom(f)$ is defined as:

$$D_f(x,y) = f(x)-f(y)-\langle\nabla f(y),x-y \rangle$$

  • Intuition: $D_f(x,y)$ represents how far off $f(x)$ is from the linear approximation gathered of $f(x)$ using the gradient and value at $y$.
    • This is easily seen in the following:

$$\begin{align*} D_f(x,y) &= f(x)-f(y)-\langle\nabla f(y),x-y \rangle \\ &=f(x)- (f(y)+\langle\nabla f(y),x-y \rangle) \end{align*}$$

  • Note that $f(y) + \langle \nabla f(y), x - y \rangle$ is the linear approximation of $f(x)$ based off values at $y$.

Examples:

  • If $f(x) = \frac{1}{2}(||x||_2)^2$, $D_f(x,y) = \frac{1}{2}(\lVert x - y\rVert_2)^2$.
  • If $f(p) = \sum p_i\log p_i$ (the negative entropy function), $D_f(p,q) = \sum p_i\log\frac{p_i}{q_i}$ (KL-Divergence)

Prob/Stats Review

Definition 4.3 (Random Variable)

A random variable $X$ is a function on some sample space $\Omega \rightarrow \mathbb{R}$.

Definition 4.4 (Cumulative Distribution Function)

Given a random variable $X$, cumulative distribution function (CDF) of $X$ is defined as $F(x)=Pr[X\leq x]$.

Definition 4.5 (Probability Density Function) Given a random variable $X$ with a differentiable CDF $F(x)$, then the probability density function (PDF) of $X$ is defined as $f(x)=F'(x)$. Note it is possible to have a random variable without a PDF (when the CDF is not differentiable).

Definition 4.6 (Expectation)

Given a random variable $X$, we define the expectation $\mathbb{E}[X]$ as follows: $$\mathbb{E}[X]=\int xdF=\int x f(x)dx$$

Where $f$ is the PDF of $X$, so $dF=f(x)dx$

Definition 4.7 (Independence)

We call two random variables $X,Y$ independent when $Pr[X\in A \wedge Y\in B]=Pr[X\in A]Pr[Y\in B]$ for all measurable $A,B \subseteq \mathbb{R}$.

Fact: When two random variables $X$ and $Y$ are independent, $\mathbb{E}[XY]=\mathbb{E}[X]\mathbb{E}[Y]$

⚠️ **GitHub.com Fallback** ⚠️