We now want to bound the supremum, which depends on the sample size $m$. Generally, as m gets larger, the deviation of the sample from the true distribution gets smaller.
Important: The following are world's apart.
There exists $m$ such that $\forall h \in \mathbb{H}$ with prob $\geq 1 - \delta$, $\left| L(h) - \hat{L}_S(h) \right|$$
There exists $m$ such that with prob $\geq 1 - \delta$, $\forall h \in \mathbb{H}$, $\left| L(h) - \hat{L}_S(h) \right|$$
2 is correct. We are dealing with supremum, so we need to make sure that no hypothesis deviates by greater than the bound, since the supremum is only dependent on the hypothesis with the greatest deviation.
The first statement looks like CLT/Hoeffding's. Second is more sophisticated, and it holds simultaneously (uniformly) over $\mathbb{H}$. This is what we need for ERM learning.
$\left| L(h) - \hat{L}_S(h) \right|$ is called "uniform distributions"
second statement is called "uniform convergence"
"Uniform Convergence Generalization Bound"
Prop: Let $\mathbb{H}$ be a set of functions mapping $h$ to ${-1, 1}$, such that $|\mathbb{H}| < \infty$.
Then, for any $\delta > 0$, with prob $\geq 1 - \delta$ over $S \sim D^m$, the following holds for all $h \in \mathbb{H}$.
Returning to ** equation: $$Pr[\cup_{h \in \mathbb{H}} (L(h) - \hat{L}_S(h) \geq \epsilon)] \leq \sum_h exp(-2m\epsilon^2) = \left| H \right| exp(-2m\epsilon^2)$$
$$\text{Want } \left| H \right| exp(-2m\epsilon^2) \leq \delta$$
$$\text{Solve for } \epsilon$$
$$-2m\epsilon^2 \leq log(\delta / \left| H \right|)$$
$$\epsilon \leq \sqrt{-\frac{\delta / \left| H \right|}{2m}} = \sqrt{\frac{log(\left| H \right|) + log(1/\delta)}{2m}}$$
Complexity of $\mathbb{H}$
Complexity of $\mathbb{H}$ correlates with size of $\mathbb{H}$ but what is complexity of infinite sized class. Can't use size, so we must have another evaluation metric.
Generalize $\mathbb{H}$ to $\mathbb{F}$ which is the set of functions $f$ mapping $X \to \mathbb{R}$.
Think of it as measuring how well $\mathbb{F}$ can correlate with Rademacher random noise.
Homework
In lecture we showed the following one-sided uniform convergence generalization bound: for $\mathcal{H}\subseteq \{h:\mathcal{X}\to\{-1,1\}\}$ such that $|\mathcal{H}|<\infty$ and any $\delta>0$, with probability at least $1-\delta$ over $S\sim\mathcal{D}^m$, the following holds for all $h\in\mathcal{H}$:
$$L(h)\leq \widehat{L}_S(h)+\sqrt{\frac{\log|\mathcal{H}|+\log 1/\delta}{2m}}.$$
This bound shows that the estimation error of the empirical risk minimizer goes to zero with $1/\sqrt{m}$, which is often called the "slow rate". Interestingly, we did not use any properties of $\mathcal{H}$ besides its size. One may wonder whether there is any advantage to choosing a hypothesis class $\mathcal{H}'$ which is the same size as $\mathcal{H}$ but contains "better" functions. In fact, it turns out that if all the functions in $\mathcal{H}'$ have sufficiently low generalization error, we can achieve a "fast rate" of $1/m$.
In order to prove the fast rate bound, we need a more sophisticated concentration bound than Hoeffding's inequality. Let us state Bernstein's inequality: Let $Z_1,\dots,Z_m$ be i.i.d. random variables with zero mean such that $|Z_i|\leq C$ and $\textnormal{Var}(Z_i)\leq D$ for all $i$. Then for all $\epsilon>0$,
$$\Pr\left[\frac{1}{m}\sum_{i=1}^m Z_i \geq \epsilon \right] \leq \exp\left(\frac{-(m\epsilon^2)/2}{D+(C\epsilon)/3}\right).$$
Graded. Let $\mathcal{H}\subseteq \{h:\mathcal{X}\to\{-1,1\}\}$ such that $|\mathcal{H}|<\infty$. Suppose there exists a function $q:\mathbb{R}\to\mathbb{R}$ such that $L(h)\leq q(m)$ for any $h\in\mathcal{H}$ and $S\subseteq \mathcal{X}$ with $|S|=m.$ Use Bernstein's inequality to prove that for any $\delta>0$, with probability at least $1-\delta$ over $S\sim\mathcal{D}^m$, the following holds for all $h\in\mathcal{H}$:
$$L(h)\leq \widehat{L}_S(h)+\sqrt{\frac{2(\log|\mathcal{H}|+\log 1/\delta)q(m)}{m}}+\frac{2(\log|\mathcal{H}|+\log 1/\delta)}{3m}.$$Hint. First, define the $Z_i$'s and compute $C$ and $D$. Then, the rest will be similar to the proof of our original bound, but with Bernstein instead of Hoeffding. It's OK if you get different constants than are listed here.
Graded. How should $q(m)$ scale in order to obtain the fast rate? It is sufficient to give an answer like $q(m)=\mathcal{O}(?)$ and explain your reasoning.
Ungraded, optional. A more general form of Bernstein's inequality holds for a very large class of distributions called subexponential distributions. These distributions are roughly characterized by having heavier tails than a Gaussian -- they decay with $e^{-x}$ instead of $e^{-x^2}$ -- and they come up often in machine learning theory. If you would like to learn more, read sections 2.7 and 2.8 of High-Dimensional Probability.