CS7545_Sp24_Lecture_08 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 08

February 1, 2024

Scribes: Nitin Vegesna, Yash Kankariya

Exam Details:

No calculator needed. Cheat sheet allowed with "normal" size paper front and back.


Recall $X$ is in input space, $D$ is distribution on $X \times$ { -1, 1 }, $\mathbb{H}$ is Hypothesis Class.

No-free-Lunch Theorem Review

Ensures that $\forall \mathbb{H}$, there is a distribution that is at least upper bound mentioned in previous lecture.

Generalization error: $$L(h) = Pr[h(x) \neq y]$$

$$S = {(x_i, y_i)}_i=1^m ~ D^m$$ Remember that S consists of i.i.d variables.

Empirical error: $$\hat{L}_s(h)= \frac{1}{m} \sum _{i=1}^m \mathbb{1} (h(x_i)\neq y_i)$$

ERM: $$\hat{h} = argmin_{h \in \mathbb{H}} \hat{L}_s(h)$$

$$h^* = argmin_{h \in \mathbb{H}} L(h)$$

$$L(\hat{h}) - L(h) = (L(\hat{h}) - L(h^*)) + (L(h^*) - L(h))$$

Estimation error: $(L(\hat{h}) - L(h^*))$

Approximation error: $(L(h^*) - L(h))$

Prop: $$L(\hat{h}) - L(h^*) \leq 2 \sup_{h \in \mathbb{h}} \left| L(h) - \hat{L}_s(h) \right|$$

We now want to bound the supremum, which depends on the sample size $m$. Generally, as m gets larger, the deviation of the sample from the true distribution gets smaller.

Important: The following are world's apart.

  1. There exists $m$ such that $\forall h \in \mathbb{H}$ with prob $\geq 1 - \delta$, $\left| L(h) - \hat{L}_S(h) \right|$$

  2. There exists $m$ such that with prob $\geq 1 - \delta$, $\forall h \in \mathbb{H}$, $\left| L(h) - \hat{L}_S(h) \right|$$

2 is correct. We are dealing with supremum, so we need to make sure that no hypothesis deviates by greater than the bound, since the supremum is only dependent on the hypothesis with the greatest deviation.

The first statement looks like CLT/Hoeffding's. Second is more sophisticated, and it holds simultaneously (uniformly) over $\mathbb{H}$. This is what we need for ERM learning.

  • $\left| L(h) - \hat{L}_S(h) \right|$ is called "uniform distributions"
  • second statement is called "uniform convergence"

"Uniform Convergence Generalization Bound"

Prop: Let $\mathbb{H}$ be a set of functions mapping $h$ to ${-1, 1}$, such that $|\mathbb{H}| < \infty$.

Then, for any $\delta > 0$, with prob $\geq 1 - \delta$ over $S \sim D^m$, the following holds for all $h \in \mathbb{H}$.

$$ L(h) \leq \hat{L}_S(h) + \sqrt{\frac{\log\left| \mathbb{H} \right| + \log\left(\frac{1}{\delta}\right)}{2m}} $$

Key term: $\sqrt{\log\left| \mathbb{H} \right|/m}$

Big research question: How to deal with infinite sized hypothesis classes.


We want to find $\epsilon$ such that $$Pr[\forall h \in H, L(h) - \hat{L}_S(h) \leq \epsilon] \geq 1 - \delta$$

Equivalent to $$Pr[\exists h \in H, L(h) - \hat{L}_S(h) \geq \epsilon] \leq \delta$$

LHS can be written as: $Pr[\cup_{h \in \mathbb{H}} (L(h) - \hat{L}_S(h) \geq \epsilon)]$

$$Pr[\cup x_i] \leq \sum Pr[x_i]$$

$$\textbf{** } Pr[\cup_{h \in \mathbb{H}} (L(h) - \hat{L}_S(h) \geq \epsilon)] \leq \sum_h Pr[(L(h) - \hat{L}_S(h) \geq \epsilon)]$$

Recall Hoeffding's $z_1, ..., z_m$ are independent random variables in $[a_i, b_i]$, $$Pr[\mathbb{E}[\frac{1}{m}\sum z_i] - \frac{1}{m} \sum z_i \geq \epsilon] \leq exp(\frac{-2(m \epsilon^2)}{\sum_m (a_i - b_i)^2})$$

For us, $z_i = \mathbb{1}(h(z_i) \neq y_i)$ and $[a_i, b_i] = [0, 1] \forall i$

By Hoeffding's: $$Pr[(L(h) - \hat{L}_S(h) \geq \epsilon)] \leq exp(\frac{-2(m \epsilon^2)}{\sum_m (1 - 0)^2}) = exp(-2m\epsilon^2)$$

Returning to ** equation: $$Pr[\cup_{h \in \mathbb{H}} (L(h) - \hat{L}_S(h) \geq \epsilon)] \leq \sum_h exp(-2m\epsilon^2) = \left| H \right| exp(-2m\epsilon^2)$$

$$\text{Want } \left| H \right| exp(-2m\epsilon^2) \leq \delta$$

$$\text{Solve for } \epsilon$$

$$-2m\epsilon^2 \leq log(\delta / \left| H \right|)$$

$$\epsilon \leq \sqrt{-\frac{\delta / \left| H \right|}{2m}} = \sqrt{\frac{log(\left| H \right|) + log(1/\delta)}{2m}}$$

Complexity of $\mathbb{H}$

Complexity of $\mathbb{H}$ correlates with size of $\mathbb{H}$ but what is complexity of infinite sized class. Can't use size, so we must have another evaluation metric.

Generalize $\mathbb{H}$ to $\mathbb{F}$ which is the set of functions $f$ mapping $X \to \mathbb{R}$.

Rademacher random variable:

$$\sigma_i = \begin{cases} +1 & \text{with prob } 1/2 \\ -1 & \text{otherwise } \end{cases}$$

Rademacher random vector:

$$\sigma = (\sigma_1, ..., \sigma_m) \text{ independent}$$

Consider $D = D_{|X}$

Def. Empirical Rademacher complexity of $\mathbb{F}$ with respect to $S =$ { $x_1$ , ..., $x_n$}

$$\hat{R}_S(\mathbb{F}) = \underset{\mathcal{\sigma}}{\mathbb{E}} [\underset{\mathcal{f \in \mathbb{F}}}{\sup} \frac{1}{m} \underset{\mathcal{i}}{\sum}^m \sigma_i f(x_i)]$$

Rademacher complexity with respect to D is:

$$\mathbb{R}(\mathbb{F}) = \underset{\mathcal{S -> D^m}}{\mathbb{E}}[\hat{R}_S(\mathbb{F})]$$

$$\mathbb{R}(\mathbb{F}) \text{ is real number, and the higher it is, the more complex } \mathbb{F} \text{ is} $$

Interpretation 1: $\sigma$ partitions $S$

$$\mathbb{R}(\mathbb{F}) = \frac{1}{m} \underset{\mathcal{\sigma}}{\mathbb{E}}[\underset{\mathcal{f \in \mathbb{F}}}{\sup}( \underset{\mathcal{i: \sigma_i = 1}}{\sum}^m \sigma_i f(x_i) - \underset{\mathcal{j: \sigma_j = -1}}{\sum}^m \sigma_j f(x_j))]$$

Think of it as measuring how different the total function values can be with a random partition of $S$.

Interpretation 2: Write $\vec{f} = (f(x_1), ..., f(x_n))$

$$\hat{R}_S(\mathbb{F}) = \frac{1}{m} \underset{\mathcal{\sigma}}{\mathbb{E}}[\underset{\mathcal{f \in \mathbb{F}}}{\sup} \langle\sigma, \vec{f}\rangle]$$

Think of it as measuring how well $\mathbb{F}$ can correlate with Rademacher random noise.


In lecture we showed the following one-sided uniform convergence generalization bound: for $\mathcal{H}\subseteq \{h:\mathcal{X}\to\{-1,1\}\}$ such that $|\mathcal{H}|<\infty$ and any $\delta>0$, with probability at least $1-\delta$ over $S\sim\mathcal{D}^m$, the following holds for all $h\in\mathcal{H}$: $$L(h)\leq \widehat{L}_S(h)+\sqrt{\frac{\log|\mathcal{H}|+\log 1/\delta}{2m}}.$$

This bound shows that the estimation error of the empirical risk minimizer goes to zero with $1/\sqrt{m}$, which is often called the "slow rate". Interestingly, we did not use any properties of $\mathcal{H}$ besides its size. One may wonder whether there is any advantage to choosing a hypothesis class $\mathcal{H}'$ which is the same size as $\mathcal{H}$ but contains "better" functions. In fact, it turns out that if all the functions in $\mathcal{H}'$ have sufficiently low generalization error, we can achieve a "fast rate" of $1/m$.

In order to prove the fast rate bound, we need a more sophisticated concentration bound than Hoeffding's inequality. Let us state Bernstein's inequality: Let $Z_1,\dots,Z_m$ be i.i.d. random variables with zero mean such that $|Z_i|\leq C$ and $\textnormal{Var}(Z_i)\leq D$ for all $i$. Then for all $\epsilon>0$, $$\Pr\left[\frac{1}{m}\sum_{i=1}^m Z_i \geq \epsilon \right] \leq \exp\left(\frac{-(m\epsilon^2)/2}{D+(C\epsilon)/3}\right).$$

  1. Graded. Let $\mathcal{H}\subseteq \{h:\mathcal{X}\to\{-1,1\}\}$ such that $|\mathcal{H}|<\infty$. Suppose there exists a function $q:\mathbb{R}\to\mathbb{R}$ such that $L(h)\leq q(m)$ for any $h\in\mathcal{H}$ and $S\subseteq \mathcal{X}$ with $|S|=m.$ Use Bernstein's inequality to prove that for any $\delta>0$, with probability at least $1-\delta$ over $S\sim\mathcal{D}^m$, the following holds for all $h\in\mathcal{H}$: $$L(h)\leq \widehat{L}_S(h)+\sqrt{\frac{2(\log|\mathcal{H}|+\log 1/\delta)q(m)}{m}}+\frac{2(\log|\mathcal{H}|+\log 1/\delta)}{3m}.$$ Hint. First, define the $Z_i$'s and compute $C$ and $D$. Then, the rest will be similar to the proof of our original bound, but with Bernstein instead of Hoeffding. It's OK if you get different constants than are listed here.

  2. Graded. How should $q(m)$ scale in order to obtain the fast rate? It is sufficient to give an answer like $q(m)=\mathcal{O}(?)$ and explain your reasoning.

  3. Ungraded, optional. A more general form of Bernstein's inequality holds for a very large class of distributions called subexponential distributions. These distributions are roughly characterized by having heavier tails than a Gaussian -- they decay with $e^{-x}$ instead of $e^{-x^2}$ -- and they come up often in machine learning theory. If you would like to learn more, read sections 2.7 and 2.8 of High-Dimensional Probability.

⚠️ **GitHub.com Fallback** ⚠️