CS7545_Sp24_Lecture_07: Generalization Error Bounds - mltheory/CS7545 GitHub Wiki
Define
We defined a hypothesis on this distribution as a function
We then defined the generalization error on on a hypothesis
We worked out that it was impossible for the minimum generalization error to always be zero, as there exist stochastic distributions where the value of the output was determined by a probabilistic outcome.
To characterize the uncertainty of the distribution, we then defined the label noise function:
Finally, using the label noise function, we defined the ideal classifier: the Bayes Classifier:
The Bayes Classifier is deemed to be the ideal classifier as it's the classifier with minimum generalization error.
While it may seem that we have solved machine learning, with the existence of the ideal classifier, there is one major issue with it:
- To learn Bayes' Classifier, we need knowledge of the entirety of
$D$ to construct$\eta$ . - However, if we know the entire distribution
$D$ , we don't need to build a classifier as we can simply infer from$D$ to build a classifier with perfect accuracy. - The entire challenge with building a classifier is that we don't have
$D$ , we only have a sample. Thus, building the Bayes' classifier is impossible!
The challenge then becomes: Can we build a "good enough" classifier, one that's close enough to the Bayes' Classifier, with just a sample over
Definition: Training Set
A training set, is a set
Note:
Definition: Empirical Error
The empirical error of a hypothesis
Note: We can alternatively define the generalization error as such:
In simpler terms, this states that the generalization error is an expectation over all possible samples drawn from the distribution. This should intuitively make sense as all samples are drawn i.i.d. from the distribution, so the true generalization error is roughly the average of the empirical errors across its samples.
Definition: Excess Error
Our goal is to bound this excess error term.
- Unfortunately, machine learning theorists have proven that this is generally impossible.
Theorem: No Free Lunch Theorem
For any algorithm
There's a couple important observations we can make about this fact:
- The term inside the expectation looks very similar to the term inside the expectation from our definition of generalization error from empirical error.
- In fact, its actually a stronger result! Instead of fixing the hypothesis, as it was in the definition of generalization bound, we instead fix the algorithm that determines the hypothesis itself. This means that this holds for any hypothesis that can be generated from the algorithm.
- The theorem effectively states that regardless of what hypothesis we can build, there exists a distribution such that the performance over the distribution is roughly no better than that of a coin flip.
The major consequence of this algorithm is that it's useless to attempt to bound excess error over all possible distributions and hypotheses.
How can we reduce the scope of our analysis? One major solution is to restrict our set of searchable hypotheses to a specific hypothesis class.
Definition: Hypothesis Class
A hypothesis class
We normally restrict our hypothesis class to a class that best represents the model we are trying to fix.
Some examples of hypothesis classes are:
- Polynomials with degree less than
$n$ - 2-Layer Neural Networks
- Linear Classifiers
The goal then becomes to compare our given model, which belongs to a class, with the best possible model in that hypothesis class.
Definition: Error Decomposition
Let
The first term is known as the estimation error, and it measures how good our hypothesis is relative to the hypothesis class it's in. The second term is known as the approximation error, which measures how good our hypothesis class is, relative to the optimal hypothesis.
Note: By the No Free Lunch Theorem, we have no way to directly calculate the approximation error. However:
- We can still approximate this value somewhat well
- If the hypothesis class selected for the problem models the problem well, analysis on the approximation error can be ignored.
Definition: Bias-Complexity Tradeoff
Suppose we plot a graph of the upper bound on the error compared to the complexity of our hypothesis class (defined later). For now, we will define the complexity of our hypothesis class as the size of the class.
- As we plot the estimation error versus complexity, the estimation error in general will tend to increase. This is because as our hypothesis class gets larger, the upper bound on the error between the optimal hypothesis in the class, and a given fixed hypothesis will increase, as we get better and better optimal hypotheses.
- Similarly, as we plot the approximation error versus complexity, the approximation error in general will tend to decrease. This is because as our hypothesis class gets larger, it becomes increasingly likely that we include a model in our class that has increasingly comparable performance to the Bayes Classifier.
- The excess error is the sum of the estimation error and the approximation error. If this is graphed, you'll notice that there exists a minimum value for this error, which is the optimum hypothesis class.
- Any model with a complexity greater than this optimum class results in a model that is overfit.
- Any model with a complexity less than this optimum class results in a model that is underfit.
- Both of these will have excess errors greater than that of the optimum.
Discussion: Is size the best measure of complexity of a given hypothesis class?
- It's good enough when discussing finite sized hypothesis classes.
- However, if we expand the hypothesis class size to that of infinite sets (especially those that are uncountable), comparing two uncountable sets causes the definition to fall apart.
Definition: Empirical Risk Minimizing Hypothesis
The empirical risk minimizing hypothesis
This hypothesis
Proposition (Bound on the ERM-Loss): Let
Remarks:
- There exists an upper bound on the estimation error.
- This upper bound depends on the sample drawn (due to the second term in the supremum).
- While the left hand side considers the specific hypothesis, the right hand side does not, it instead bounds the performance with respect to the hypothesis class itself.
- This implies that the bound is independent of the algorithm or elected hypothesis, it instead depends on the hypothesis class instead.
- This bound compares the difference in generalization error with a difference between generalization error and empirical error.
- The upper bound is tightened as the sample approaches the distribution.
- As the drawn sample gets larger, the empirical errors will slowly begin to converge towards their expectation over
$\mathcal{H}$ , which is defined to be the generalization error. - This vaguely looks like a concentration bound, but that is a discussion for later.
- As the drawn sample gets larger, the empirical errors will slowly begin to converge towards their expectation over
Proof of the Proposition:
Begin with the left-hand side:
At this point we can group the terms based on their hypotheses, as such, the RHS of above is equal to:
We can then claim that each of these two absolute value terms are less than the deviation in error incurred by the worst possible classifier in the hypothesis class. More specifically, each of the two terms above are less than:
As each of the two deviation terms are less than this deviation term found above, we can claim that their sum is less than:
This completes the proof.
-
Graded. Is it possible for an ERM hypothesis
$\hat{h}_S\in \mathcal{H}$ on a set$S\subseteq \mathcal{X}$ to have$\widehat{L}_S(\hat{h}_S)=0$ but$L(\hat{h}_S)=1$ ? Why or why not? Would this be overfitting or underfitting? What is the role of the complexity of$\mathcal{H}$ in this situation? -
Graded. Let
$\mathcal{H}$ be a hypothesis class,$\hat{h}_S\in \mathcal{H}$ be an ERM hypothesis for a sample$S$ , and$h^\star \in \arg\min_{h\in\mathcal{H}} L(h)$ . Show that$$\mathbb{E}_{S\sim\mathcal{D}^m}[\widehat{L}_S(\hat{h}_S)]\leq L(h^\star) \leq \mathbb{E}_{S\sim\mathcal{D}^m}[L(\hat{h}_S)].$$ -
Ungraded, optional. Here are some resources if you would like to study the proof of the no-free-lunch theorem. Most sources prove a simpler version; the one from class takes a bit more work. I highly recommend this illustrated proof. For a textbook version, see Section 5.1 "The No-Free-Lunch Theorem" in Understanding Machine Learning: From Theory to Algorithms. (Don't worry about the PAC-learning stuff in Corollary 5.2).