CS7545_Sp24_Lecture_10 - mltheory/CS7545 GitHub Wiki
CS 7545: Machine Learning Theory -- Spring 2024
Instructor: Tyler LaBonte
Notes for Lecture 10
February 15, 2024
Scribes:Meena Nagarajan, Caleb McFarland
Setup
$\mathcal{X}$ is an input space. $\mathcal{D}$ is a distribution on $\mathcal{X}$. Our sample is $S = \{x_1, \ldots ,x_m\}$ drawn i.i.d. from $\mathcal{D}$. $\mathcal{F} \subseteq \{f: \mathcal{X} \rightarrow \mathbb{R}\}$ is the class of functions. $\sigma = (\sigma_1,\ldots , \sigma_m)$ is a Rademacher random variable, so each $\sigma_i \in \{-1, 1\}$ chosen independently with equal probability. We have the following definitions of the Emperical Rademacher Complexity and Rademacher Complexity respectively.
We want to to use the Rademacher Complexity to bound the uniform deviations $\underset{h \in \mathcal{H}}{\sup}|L(h) - \widehat{L}_S(h)|$ when $|H| = \infty$. We will generalize the uniform devations to $f: \mathcal{X} \rightarrow \mathbb{R}$ using the definitions below and get the desired bound as a corollary.
Here you can think of $L(f)$ as the true mean and $\widehat{L}_S(f)$ as the sample mean of $f$. This is in fact a generalization of $L(h)$ by defining $\mathcal{X}' = \mathcal{X} \times \{-1, 1\}$ so that you receive $z = (x,y)$ from $\mathcal{X}'$ and set $f(z) = \mathbb{1}(h(x) \neq y)$. In this framework we have that $L(f) = L(h)$. We will ignore the absolute value in the uniform deviations for now.
Symmetrization Theorem
Suppose $\mathcal{F} \subseteq \{f: \mathcal{X} \rightarrow \mathbb{R}\}$. Then $\displaystyle\underset{S \sim \mathcal{D}^m}{\mathbb{E}}\left[\underset{f \in \mathcal{F}}{\sup}L(f) - \widehat{L}_S(f)\right] \leq 2\mathfrak{R}(\mathcal{F})$. Note that there is a close lower bound proved in the homework, so this shows that studying uniform deviations is essentially the same as studying Rademacher complexity.
Proof:
By definition, $\displaystyle L(f) = \underset{S \sim \mathcal{D}^m}{\mathbb{E}}[\widehat{L}_S(f)]$. Our first trick is to introduce a "ghost sample" $S' = \{x'_1, ..., x'_m\}$ drawn i.i.d. from $\mathcal{D}$ and write the following.
because $S$ and $S'$ are independent. Subadditivity of the supremum states that the supremum of a sum is at most the sum of the supremum, so we proceed as follows.
Why is this true? If $\sigma_i = 1$ then the summand $f(x'_i) - f(x_i)$ stays the same. If $\sigma_i = -1$ then the summand becomes $f(x_i) - f(x'_i)$. We're taking the expectation over both $S$ and $S'$, so we are integrating over all draws of $S,S' \sim \mathcal{D}^m$. For a fixed $\{z_1, \ldots, z_m\}$ and $\{z'_1, \ldots, z'_m\}$, it is equally likely that $S = \{z_1, \ldots, z_m\}$ while $S' = \{z'_1, \ldots, z'_m\}$$ as it is that $S = \{z'_1, ..., z'_m\}$ and $S' = \{z_1, \ldots, z_m\}$. This means that the sum of all $S,S'$ is symmetric and all $\sigma_i$ does is change the order of that summation.
$$= \underset{S, \sigma}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} \sigma_if(x_i)\right] + \underset{S, \sigma}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} \sigma_if(x_i)\right]$$
because $S$ and $S'$ have the same distribution and $\sigma_i$ and $-\sigma_i$ have the same distribution.
Suppose $\mathcal{F} \subseteq \{f: \mathcal{X} \rightarrow [0, 1]\}$
Then for any $\delta > 0$, with probability at least $1 - \delta$ over $S \sim D^m$, for all $f \in \mathcal{F}$, we have:
We introduce a new concentration bound called McDiarmid’s Inequality.
Suppose $g: \mathcal{X}^m \rightarrow \mathbb{R}$ i.e., $g(S) =$ real number. And there exists $C_1, \ldots, C_m > 0$ such that for all
Graded. Prove that $\mathfrak{R}(A+b) = \mathfrak{R}(A)$ where $A+b={a+b: a\in A}$ for any $b\in\mathbb{R}^m$.
Graded. Prove that $\mathfrak{R}(cA) = |c| \mathfrak{R}(A)$ where $cA = { c\cdot a: a\in A}$ for any $c\in\mathbb{R}$.
Graded. In lecture we stated the following one-sided uniform convergence generalization bound: for $\mathcal{F}$ containing functions $f:\mathcal{X} \to [0,1]$ and any $\delta>0$, with probability at least $1-\delta$ over $S\sim\mathcal{D}^m$, the following holds for all $f\in\mathcal{F}$:
$$L(f) \leq \widehat{L}_S(f) + 2\mathfrak{R}(\mathcal{F}) + \sqrt{\frac{\log 1/\delta}{2m}}.$$
However, to show a bound on the estimation error of ERM we actually needed a two-sided bound, on $\sup_{f\in\mathcal{F}} \big| L(f)-\widehat{L}_S(f) \big|$. Use parts (1) and (2) to prove one. (You must use parts (1) and (2)).
Challenge, optional, 1 point extra credit. Let $S\sim \mathcal{D}^m$ and suppose $\mathcal{F}$ contains functions $f:\mathcal{X}\to [0,1]$. Prove the symmetrization lower bound, also called the desymmetrization inequality:
$$\frac{1}{2}\mathfrak{R}(\mathcal{F}) - \sqrt{\frac{\log 2}{2m}} \leq \mathbb{E}_{S} \left[ \sup_{f\in\mathcal{F}} \left| L(f) - \widehat{L}_S(f)\right|\right].$$