CS7545_Sp23_Lecture_04: Convex Analysis Continued and Statistics Review - mltheory/CS7545 GitHub Wiki
- Homework must be typed, but using LaTeX is optional.
- Office Hours will be posted soon. These office hours will be done primarily virtually, but in-person office hours are possible.
Definition 4.1 (Fenchel Conjugate)
Given a convex function
- Intuition: In a strictly convex function
$f$ , there is a unique mapping$x \leftrightarrow \nabla f(x)$ . The Fenchel conjugate maps from the gradient back to the input of the function.- Note that a strictly convex function is one where the line segment connecting any two points on the function lies strictly above the function. You can think of this meaning that the function doesn't have any "flat" regions or that it is "curved up a little bit everywhere"
- Mathematically, this means for all
$\alpha\in[0,1]$ and$x,y\in dom(f)$ :
- Note that
$f^*$ may have a different domain than$f$ so we use$\theta$ as the argument for$f^*$ rather than$x$ to avoid any confusion.
Examples:
- The dual of
$f(x) = \frac{1}{2}(||x||_2)^2$ is$f^*(\theta) = \frac{1}{2}(||\theta||_2)^2$ (Try deriving this as an exercise). - The dual of
$f(x) = \frac{1}{2}x^TMx$ , where M is PSD, is$f^*(\theta) = \frac{1}{2}\theta^TM^{-1}\theta$ .
Proof:
For positive definite matrix M,
To solve for the supremum, we let
Proof for gradient of
Now we can just substitute the value we got for
Facts about Fenchel Conjugates
- If f is closed, then
$(f^*)^* = f$ . - If f is differentiable and strictly convex, then
$\nabla f$ and$\nabla (f^*)$ are inverse maps. That is,$\nabla f^*(\nabla f(x)) = x$ and$\nabla f(\nabla f^*(\theta)) = \theta$ (think about how this relates to the intuition mentioned earlier).
Theorem 4.1 (Fenchel-Young Inequality) Given a convex function
Proof:
This can actually be used to prove the result of Young's Inequality since the Fenchel Conjugate of
Definition 4.2 (Bregman Divergence)
For a convex, differentiable function
- Intuition:
$D_f(x,y)$ represents how far off$f(x)$ is from the linear approximation gathered of$f(x)$ using the gradient and value at$y$ .- This is easily seen in the following:
- Note that
$f(y) + \langle \nabla f(y), x - y \rangle$ is the linear approximation of$f(x)$ based off values at$y$ .
Examples:
- If
$f(x) = \frac{1}{2}(||x||_2)^2$ ,$D_f(x,y) = \frac{1}{2}(\lVert x - y\rVert_2)^2$ . - If
$f(p) = \sum p_i\log p_i$ (the negative entropy function),$D_f(p,q) = \sum p_i\log\frac{p_i}{q_i}$ (KL-Divergence)
Definition 4.3 (Random Variable)
A random variable
Definition 4.4 (Cumulative Distribution Function)
Given a random variable
Definition 4.5 (Probability Density Function)
Given a random variable
Definition 4.6 (Expectation)
Given a random variable
Where
Definition 4.7 (Independence)
We call two random variables
Fact: When two random variables