CS7545_Sp23_Lecture_07: Statistical Learning and ERM - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 07

January 31, 2023

Scribes: Hassan, Bar Alluf

Statistical Learning Theory (Continued)

Lets revisit a few definitions:

  • $\chi$ is an input space and $D$ is a distribution on $\chi \times {-1,1}$
  • $H$ is our hypothesis class distributed over $h:X \rightarrow {-1,1}$
  • $L(h)$ is our generaliztion error/risk. It can be represented as the probability we misclassify a point. That is, $l(h) = P_{x,y \thicksim D}[h(x) \neq y]$
  • Our empirical error of a hypothesis $h$ on a training set $S= { (x_{i},y_{i})}_{i=1}^{m} \sim D$ is given by $$\hat{L}_s(h)= \frac{1}{m} \sum _{i=1}^m \mathbf{1} (h(x_i)\neq y_i)$$ where $m$ is the number of data points

Given this we want to use our empirical errors to get: $$L(h_{bayes}) = \min_h L(h)$$ If $h^{*}$ is our best in class hypothesis, it minimizes our generalization errors: $$L(h^{*}) =\min_ {h\in H} L(h)$$

Error Decomposition

We can decompose our errors as: $$L(h)- L_{bayes} =(L(h) - L(h^{*}))+(L(h^{*})- L_{bayes})$$ Where the term $(L(h) - L(h^{*}))$ is our estimation error and $(L(h^{*})- L(H_{bayes}))$ is our approximation error. For more details on this, visit the No Free Lunch Theorem section in Lecture 6.

Empirical Risk Minimization (ERM)

ERM is employed to select a hypothesis. We want to choose $\hat{h}$ which lies in the minimizers of the empirical $H$. $$\hat{h} \in argmin_{h\in H} \hat{L}_s(h)$$ However, with the possibility of infinitely many hypothesis in $H$, this problem might NP hard to compute. We limit the cardinality of the search space $H$, but this introduces bias.

Note on Inductive Bias

We normally restrict the search space of the ERM learner by searching over a finite set of hypothesis class $H$. By doing so, we are introducing a source of bias, and such restrictions are often called Inductive Bias. However, this also helps tackle overfitting. To quote from the textbook: "Intuitively, choosing a more restricted hypothesis class better protects us against overfitting but at the same time might cause us a stronger inductive bias." (Chapter 2.3, page 36)

Proposition 7.1

Prop Let $\hat h$ be an ERM hypothesis on $S$. Then, $$L(\hat{h})- L(h^{*}) \leq 2 sup_{h\in H}|L(h)-\hat{L}_s(h)|$$ Our $L(\hat{h})$ is the generalization error of the Empirical Risk Minimizer over the training set $S$, whereas $L(h^{*})$ is the generalization error of our best hypothesis. This proposition states that the difference between the two of them is bounded by a property of the class.

Proof $$L(\hat h) - L(h^{*}) \le L(\hat{h}) - L(h^{*}) + \hat L_s(h^{*}) - \hat{L}_s(\hat h)$$

Note that the last two terms are nonnegative because $\hat h$ is the minimizer of $\hat L_s$.

$$\leq |L(\hat{h}) - \hat{L}_s(\hat{h})| + | L(h^{*}) - \hat{L}_s(h^*)|$$

These are two hypothesis, so the difference can't be bigger than the maximum $$\leq 2 sup_{h\in H} |L(h) - \hat{L}_s(h)|$$

RHS small if the empirical errors $\hat L_s(h)$ are close to the generalization errors $L(h)$ for all $h\in H$.

Proposition 7.2

Before moving on the next proposition, notice how the following two statements are different:

  1. There exists an $m$ such that for every $h\in H$, with probability $1-\delta$, $|L(h) - \hat{L}_s(h)| \le \varepsilon$
  2. There exists an $m$ such that with probability $1-\delta$, for every $h\in H$, $|L(h) - \hat{L}_s(h)| \le \varepsilon$

Note: $|s| = m$ is the size of the data set.

The first statement follows from the central limit theorem. It is a similar idea to "when the dataset size is large enough, the empirical mean and the true mean are close".

The second statement is more sophisticated. In particular, it depends on the "complexity" of $H$. This fact stems from a particular case of Uniform Convergence called uniform derivations. The first statement can infact follow from the second, but when strictly proving the first we can find a smaller bound for $m$.

A uniform convergence generalization bound is a one-sided $L(h) - \hat L_s(h)$ bound.

Prop: Let $H$ be a finite hypothesis class. Then for any $\delta> 0$, with probability greater than $1-\delta$ over a sample $S \thicksim D^m$, the following holds for all $h\in H$:

$$L(h) \le \hat L_s(h) + \sqrt{ \frac{\ln(|H|)+\ln(\delta^{-1})}{2m} }$$

Note: one of the key terms in this is: $\ln(|H|) / m$, which tells us this bound increases with an increase in the size of the class $|H|$ and decreases as we get more data.

Proof using Hoeffding

In our case, $x_i = \mathbf{1} (h(x_i) \ne y_i)$ Bounded in $[0,1]$ for all $i$.

$$P\bigg[E [\sum_i^m x_i] - \sum_i^m x_i \ge \epsilon' \bigg]\le \exp\left( - \frac{2\epsilon'^2}{m} \right)$$

We want to quantify the failure probability. To connect this result to the empirical expectation, we divide this equation by $m$ and define a new $\epsilon = \frac{\epsilon'}{m}$, which leads to

$$P[L(h) - \hat L_s(h) \ge \varepsilon] = P\bigg[E [\frac{1}{m}\sum_i^m x_i] - \frac{1}{m}\sum_i^m x_i \ge \frac{1}{m}\varepsilon m \bigg]\le \exp\left( - \frac{2(\varepsilon m)^2}{m} \right) = \exp\left( -2 m \varepsilon^2 \right)$$

Notice how the $m$ moves from the denominator to the numerator of the exponent term when we divide by $m$. This is natural as deviations can increase when we are summing up terms, but decrease as we average them.

To bound multiple failure cases, we can use Union Bounds. This is a simple statement:

$$P[\bigcup_{i=1}^n X_i] \leq \sum_{i=1}^n P[X_i]$$

Our probability of failure now becomes:

$$P[\exists h \in H : L(h) - \hat L_s(h) \ge \varepsilon] = P\bigg[\bigcup_{h\in H}[L(h) - \hat L_{s}(h) \ge \varepsilon]\bigg] \\\ \le \sum_{h \in H} P[L(h) - \hat L_{s}(h) \ge \varepsilon]\\ \le \sum_{h \in H}\exp(-2m\varepsilon^2) = |H| \exp(-2m\varepsilon^2)$$

The probability that we fail is less than or equal to $\delta$ as mentioned above. We can solve for epsilon to get

$$\varepsilon = \sqrt{\frac{-\ln(\delta/|H|)}{2m}}$$
⚠️ **GitHub.com Fallback** ⚠️