CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 10

February 15, 2024

Scribes:Meena Nagarajan, Caleb McFarland

Setup

$\mathcal{X}$ is an input space. $\mathcal{D}$ is a distribution on $\mathcal{X}$. Our sample is $S = \{x_1, \ldots ,x_m\}$ drawn i.i.d. from $\mathcal{D}$. $\mathcal{F} \subseteq \{f: \mathcal{X} \rightarrow \mathbb{R}\}$ is the class of functions. $\sigma = (\sigma_1,\ldots , \sigma_m)$ is a Rademacher random variable, so each $\sigma_i \in \{-1, 1\}$ chosen independently with equal probability. We have the following definitions of the Emperical Rademacher Complexity and Rademacher Complexity respectively.

$$\widehat{\mathfrak{R}}_S(\mathcal{F}) = \underset{\sigma}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} \sigma_i f(x_i)\right]$$

$$\mathfrak{R}(\mathcal{F}) = \underset{S \sim \mathcal{D}^m}{\mathbb{E}}[\widehat{\mathfrak{R}}_S(\mathcal{F})]$$

We want to to use the Rademacher Complexity to bound the uniform deviations $\underset{h \in \mathcal{H}}{\sup}|L(h) - \widehat{L}_S(h)|$ when $|H| = \infty$. We will generalize the uniform devations to $f: \mathcal{X} \rightarrow \mathbb{R}$ using the definitions below and get the desired bound as a corollary.

$$L(f) = \underset{\mathcal{X} \sim \mathcal{D}}{\mathbb{E}}[f(x)]$$

$$\widehat{L}_S(f) = \frac{1}{m} \underset{i=1}{\sum^m}f(x_i)$$

Here you can think of $L(f)$ as the true mean and $\widehat{L}_S(f)$ as the sample mean of $f$. This is in fact a generalization of $L(h)$ by defining $\mathcal{X}' = \mathcal{X} \times \{-1, 1\}$ so that you receive $z = (x,y)$ from $\mathcal{X}'$ and set $f(z) = \mathbb{1}(h(x) \neq y)$. In this framework we have that $L(f) = L(h)$. We will ignore the absolute value in the uniform deviations for now.

Symmetrization Theorem

Suppose $\mathcal{F} \subseteq \{f: \mathcal{X} \rightarrow \mathbb{R}\}$. Then $\displaystyle\underset{S \sim \mathcal{D}^m}{\mathbb{E}}\left[\underset{f \in \mathcal{F}}{\sup}L(f) - \widehat{L}_S(f)\right] \leq 2\mathfrak{R}(\mathcal{F})$. Note that there is a close lower bound proved in the homework, so this shows that studying uniform deviations is essentially the same as studying Rademacher complexity.

Proof:

By definition, $\displaystyle L(f) = \underset{S \sim \mathcal{D}^m}{\mathbb{E}}[\widehat{L}_S(f)]$. Our first trick is to introduce a "ghost sample" $S' = \{x'_1, ..., x'_m\}$ drawn i.i.d. from $\mathcal{D}$ and write the following.

$$L(f) = \underset{S' \sim \mathcal{D}^m}{\mathbb{E}}[\widehat{L}_{S'}(f)]$$

Then we get

$$\underset{S \sim \mathcal{D}^m}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} L(f) - \widehat{L}_S(f)\right]$$

$$= \underset{S \sim \mathcal{D}^m}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \underset{S' \sim \mathcal{D}^m}{\mathbb{E}}[\widehat{L}_{S'}(f)] - \widehat{L}_S(f)\right]$$

$$= \underset{S \sim \mathcal{D}^m}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \underset{S' \sim \mathcal{D}^m}{\mathbb{E}}[\widehat{L}_{S'}(f) - \widehat{L}_S(f)]\right]$$

because $S$ and $S'$ are independent. Subadditivity of the supremum states that the supremum of a sum is at most the sum of the supremum, so we proceed as follows.

$$\leq \underset{S,S'}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \widehat{L}_{S'}(f) - \widehat{L}_S(f)\right]$$

$$= \underset{S,S'}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} (f(x'_i) - f(x_i))\right]$$

Our second trick is symmetrization. This step is why we use Rademacher random variables as opposed to some other kind of random variable.

$$= \underset{S,S', \sigma}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} \sigma_i(f(x'_i) - f(x_i))\right]$$

Why is this true? If $\sigma_i = 1$ then the summand $f(x'_i) - f(x_i)$ stays the same. If $\sigma_i = -1$ then the summand becomes $f(x_i) - f(x'_i)$. We're taking the expectation over both $S$ and $S'$, so we are integrating over all draws of $S,S' \sim \mathcal{D}^m$. For a fixed $\{z_1, \ldots, z_m\}$ and $\{z'_1, \ldots, z'_m\}$, it is equally likely that $S = \{z_1, \ldots, z_m\}$ while $S' = \{z'_1, \ldots, z'_m\}$$ as it is that $S = \{z'_1, ..., z'_m\}$ and $S' = \{z_1, \ldots, z_m\}$. This means that the sum of all $S,S'$ is symmetric and all $\sigma_i$ does is change the order of that summation.

$$= \underset{S,S', \sigma}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} \sigma_if(x'_i) + \frac{1}{m}\underset{i=1}{\sum^m} -\sigma_if(x_i)\right]$$

$$\leq \underset{S,S', \sigma}{\mathbb{E}} \left[\left(\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} \sigma_if(x'_i)\right) + \left(\underset{f \in \mathcal{F}}{\sup}\frac{1}{m}\underset{i=1}{\sum^m} -\sigma_if(x_i)\right)\right]$$

$$= \underset{S', \sigma}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} \sigma_if(x'_i)\right] + \underset{S, \sigma}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} -\sigma_if(x_i)\right]$$

$$= \underset{S, \sigma}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} \sigma_if(x_i)\right] + \underset{S, \sigma}{\mathbb{E}} \left[\underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{i=1}{\sum^m} \sigma_if(x_i)\right]$$ because $S$ and $S'$ have the same distribution and $\sigma_i$ and $-\sigma_i$ have the same distribution.

$$= \mathfrak{R}(\mathcal{F}) + \mathfrak{R}(\mathcal{F}) = 2\mathfrak{R}(\mathcal{F})$$

Theorem:

Suppose $\mathcal{F} \subseteq \{f: \mathcal{X} \rightarrow [0, 1]\}$ Then for any $\delta > 0$, with probability at least $1 - \delta$ over $S \sim D^m$, for all $f \in \mathcal{F}$, we have:

$$L(f) \leq \widehat{L}_{S}(f) + 2\mathfrak{R}(\mathcal{F}) + \sqrt{\frac{\log\frac{1}{\delta}}{2m}}$$
$$L(f) \leq \widehat{L}_{S}(f) + 2\widehat{\mathfrak{R}}_S(\mathcal{F}) + 3\sqrt{\frac{\log\frac{2}{\delta}}{2m}}$$

We introduce a new concentration bound called McDiarmid’s Inequality. Suppose $g: \mathcal{X}^m \rightarrow \mathbb{R}$ i.e., $g(S) =$ real number. And there exists $C_1, \ldots, C_m > 0$ such that for all

$$S = \{X_1, \ldots, X_i, \ldots, X_m\}$$

$$S' = \{X_1, \ldots, X'_i, \ldots, X_m\}$$

where $S$ and $S'$ differ only at position $i$,

$$|g(S) - g(S')| \leq C_i$$

Then for all $\epsilon > 0$,

$$\underset{S \sim D^m}{Pr}[g(s) - \mathbb{E}[g(s)] \geq \epsilon] \leq \exp\left(\frac{-2\epsilon ^2}{\Sigma_{i=1}^{m}C_i^2}\right)$$

Proof:

Set

$$g(S) = \underset{f \in \mathcal{F}}{\sup} L(f) - \widehat{L}_{S}(f)$$

We need

$$ g(S') - g(S) \leq C_i$$

$$ \underset{f \in \mathcal{F}}{\sup} (L(f) - \widehat{L}_{S'}(f)) - \underset{f \in \mathcal{F}}{\sup} (L(f) - \widehat{L}_S(f))$$

Using subadditivity,

$$ \leq \underset{f \in \mathcal{F}}{\sup} L(f) - \widehat{L}_{S'}(f) - L(f) + \widehat{L}_S(f)$$

$$ = \underset{f \in \mathcal{F}}{\sup} (-\widehat{L}_{S'}(f) + \widehat{L}_S(f))$$

$$ = \underset{f \in \mathcal{F}}{\sup} \frac{1}{m} \underset{j=1}{\sum^{m}}(f(X_j) - f(X_j'))$$

where $S = \{X_1, \ldots, X_m\}$ and $S' = \{X_1', \ldots, X_m'\}$, but $S = S'$ except when $X_i \not= X_i'$

$$ = \underset{f \in \mathcal{F}}{\sup} \frac{f(X_i) - f(X_i')}{m} $$

$$ \leq \frac{1}{m}$$

So we should set $C_i = \frac{1}{m}$ for all $i$.

Homework

Suppose $A\subseteq \mathbb{R}^m$.

Graded. Prove that $\mathfrak{R}(A+b) = \mathfrak{R}(A)$ where $A+b={a+b: a\in A}$ for any $b\in\mathbb{R}^m$.
Graded. Prove that $\mathfrak{R}(cA) = |c| \mathfrak{R}(A)$ where $cA = { c\cdot a: a\in A}$ for any $c\in\mathbb{R}$.
Graded. In lecture we stated the following one-sided uniform convergence generalization bound: for $\mathcal{F}$ containing functions $f:\mathcal{X} \to [0,1]$ and any $\delta>0$, with probability at least $1-\delta$ over $S\sim\mathcal{D}^m$, the following holds for all $f\in\mathcal{F}$: $$L(f) \leq \widehat{L}_S(f) + 2\mathfrak{R}(\mathcal{F}) + \sqrt{\frac{\log 1/\delta}{2m}}.$$ However, to show a bound on the estimation error of ERM we actually needed a two-sided bound, on $\sup_{f\in\mathcal{F}} \big| L(f)-\widehat{L}_S(f) \big|$. Use parts (1) and (2) to prove one. (You must use parts (1) and (2)).
Challenge, optional, 1 point extra credit. Let $S\sim \mathcal{D}^m$ and suppose $\mathcal{F}$ contains functions $f:\mathcal{X}\to [0,1]$. Prove the symmetrization lower bound, also called the desymmetrization inequality: $$\frac{1}{2}\mathfrak{R}(\mathcal{F}) - \sqrt{\frac{\log 2}{2m}} \leq \mathbb{E}_{S} \left[ \sup_{f\in\mathcal{F}} \left| L(f) - \widehat{L}_S(f)\right|\right].$$

CS7545_Sp24_Lecture_10 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 10

February 15, 2024

Scribes:Meena Nagarajan, Caleb McFarland

Setup

Symmetrization Theorem

Proof:

Theorem:

Proof:

Homework

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp24_Lecture_10 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 10

February 15, 2024

Scribes:Meena Nagarajan, Caleb McFarland

Setup

Symmetrization Theorem

Proof:

Theorem:

Proof:

Homework

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️