CS7545_Sp23_Lecture_07: Statistical Learning and ERM - mltheory/CS7545 GitHub Wiki
CS 7545: Machine Learning Theory -- Spring 2023
Instructor: Tyler Labonte
Notes for Lecture 07
January 31, 2023
Scribes: Hassan, Bar Alluf
Statistical Learning Theory (Continued)
Lets revisit a few definitions:
$\chi$ is an input space and $D$ is a distribution on $\chi \times {-1,1}$
$H$ is our hypothesis class distributed over $h:X \rightarrow {-1,1}$
$L(h)$ is our generaliztion error/risk. It can be represented as the probability we misclassify a point. That is, $l(h) = P_{x,y \thicksim D}[h(x) \neq y]$
Our empirical error of a hypothesis $h$ on a training set $S= { (x_{i},y_{i})}_{i=1}^{m} \sim D$ is given by
$$\hat{L}_s(h)= \frac{1}{m} \sum _{i=1}^m \mathbf{1} (h(x_i)\neq y_i)$$
where $m$ is the number of data points
Given this we want to use our empirical errors to get:
$$L(h_{bayes}) = \min_h L(h)$$
If $h^{*}$ is our best in class hypothesis, it minimizes our generalization errors:
$$L(h^{*}) =\min_ {h\in H} L(h)$$
Error Decomposition
We can decompose our errors as:
$$L(h)- L_{bayes} =(L(h) - L(h^{*}))+(L(h^{*})- L_{bayes})$$
Where the term $(L(h) - L(h^{*}))$ is our estimation error and $(L(h^{*})- L(H_{bayes}))$ is our approximation error. For more details on this, visit the No Free Lunch Theorem section in Lecture 6.
Empirical Risk Minimization (ERM)
ERM is employed to select a hypothesis. We want to choose $\hat{h}$ which lies in the minimizers of the empirical $H$.
$$\hat{h} \in argmin_{h\in H} \hat{L}_s(h)$$
However, with the possibility of infinitely many hypothesis in $H$, this problem might NP hard to compute. We limit the cardinality of the search space $H$, but this introduces bias.
Note on Inductive Bias
We normally restrict the search space of the ERM learner by searching over a finite set of hypothesis class $H$. By doing so, we are introducing a source of bias, and such restrictions are often called Inductive Bias. However, this also helps tackle overfitting. To quote from the textbook:
"Intuitively, choosing a more restricted hypothesis class better protects us
against overfitting but at the same time might cause us a stronger inductive
bias." (Chapter 2.3, page 36)
Proposition 7.1
Prop
Let $\hat h$ be an ERM hypothesis on $S$. Then,
$$L(\hat{h})- L(h^{*}) \leq 2 sup_{h\in H}|L(h)-\hat{L}_s(h)|$$
Our $L(\hat{h})$ is the generalization error of the Empirical Risk Minimizer over the training set $S$, whereas $L(h^{*})$ is the generalization error of our best hypothesis. This proposition states that the difference between the two of them is bounded by a property of the class.
These are two hypothesis, so the difference can't be bigger than the maximum
$$\leq 2 sup_{h\in H} |L(h) - \hat{L}_s(h)|$$
RHS small if the empirical errors $\hat L_s(h)$ are close to the generalization errors $L(h)$ for all $h\in H$.
Proposition 7.2
Before moving on the next proposition, notice how the following two statements are different:
There exists an $m$ such that for every $h\in H$, with probability $1-\delta$, $|L(h) - \hat{L}_s(h)| \le \varepsilon$
There exists an $m$ such that with probability $1-\delta$, for every $h\in H$, $|L(h) - \hat{L}_s(h)| \le \varepsilon$
Note:$|s| = m$ is the size of the data set.
The first statement follows from the central limit theorem. It is a similar idea to "when the dataset size is large enough, the empirical mean and the true mean are close".
The second statement is more sophisticated. In particular, it depends on the "complexity" of $H$. This fact stems from a particular case of Uniform Convergence called uniform derivations. The first statement can infact follow from the second, but when strictly proving the first we can find a smaller bound for $m$.
A uniform convergence generalization bound is a one-sided $L(h) - \hat L_s(h)$ bound.
Prop: Let $H$ be a finite hypothesis class. Then for any $\delta> 0$, with probability greater than $1-\delta$ over a sample $S \thicksim D^m$, the following holds for all $h\in H$:
Note: one of the key terms in this is: $\ln(|H|) / m$, which tells us this bound increases with an increase in the size of the class $|H|$ and decreases as we get more data.
Proof using Hoeffding
In our case, $x_i = \mathbf{1} (h(x_i) \ne y_i)$ Bounded in $[0,1]$ for all $i$.
We want to quantify the failure probability. To connect this result to the empirical expectation, we divide this equation by $m$ and define a new $\epsilon = \frac{\epsilon'}{m}$, which leads to
Notice how the $m$ moves from the denominator to the numerator of the exponent term when we divide by $m$. This is natural as deviations can increase when we are summing up terms, but decrease as we average them.
To bound multiple failure cases, we can use Union Bounds. This is a simple statement: