CS7545_Sp24_Lecture_07: Generalization Error Bounds - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 07

January 30, 2024

Scribes: Adithya Vasudev, Abinav Chari

Review: The Bayes Classifier

Define $\chi$ to be an input space, and $D$ to be a distribution on the cartesian product $\chi \times \lbrace -1, 1 \rbrace$.

We defined a hypothesis on this distribution as a function $h: \chi \rightarrow \lbrace -1, 1 \rbrace$

We then defined the generalization error on on a hypothesis $h$ to be $L(h) = P_{(x, y) \sim D} (h(x) \neq y)$. This is a value in $[0, 1]$.

We worked out that it was impossible for the minimum generalization error to always be zero, as there exist stochastic distributions where the value of the output was determined by a probabilistic outcome.

To characterize the uncertainty of the distribution, we then defined the label noise function: $\eta(x) = P(y = 1 \mid x)$

Finally, using the label noise function, we defined the ideal classifier: the Bayes Classifier:

$$ h_{\text{Bayes}}(x) = \begin{cases} 1 & \text{if } \eta(x) \geq \frac{1}{2}\\ -1 & \text{if } \eta(x) < \frac{1}{2} \end{cases} $$

The Bayes Classifier is deemed to be the ideal classifier as it's the classifier with minimum generalization error.

Lecture 7: Generalization Error Bounds (continued)

The Problem with Bayes' Classifier

While it may seem that we have solved machine learning, with the existence of the ideal classifier, there is one major issue with it:

  • To learn Bayes' Classifier, we need knowledge of the entirety of $D$ to construct $\eta$.
  • However, if we know the entire distribution $D$, we don't need to build a classifier as we can simply infer from $D$ to build a classifier with perfect accuracy.
  • The entire challenge with building a classifier is that we don't have $D$, we only have a sample. Thus, building the Bayes' classifier is impossible!

The challenge then becomes: Can we build a "good enough" classifier, one that's close enough to the Bayes' Classifier, with just a sample over $D$?

No Free Lunch Theorem

Definition: Training Set

A training set, is a set $S = {(x_i, y_i)}_{i = 1}^{m}$ sampled from $D$, where each sample $(x_i, y_i) \in S$ is i.i.d. from $D^m$

Note: $D^m$ is notation for drawing $m$ samples from $D$.

Definition: Empirical Error

The empirical error of a hypothesis $h$ over a training set $S$ is defined as:

$$\hat{L}_S(h) = \frac 1 m \sum _{i=1}^m \mathbb{1} (h(x_i)\neq y_i)$$

Note: We can alternatively define the generalization error as such:

$$L(h) = \mathbb{E}_{S \sim D^m}(\hat{L}_S(h))$$

In simpler terms, this states that the generalization error is an expectation over all possible samples drawn from the distribution. This should intuitively make sense as all samples are drawn i.i.d. from the distribution, so the true generalization error is roughly the average of the empirical errors across its samples.

Definition: Excess Error

$$L_{excess}(h) = L(h) - L_{Bayes}$$

Our goal is to bound this excess error term.

  • Unfortunately, machine learning theorists have proven that this is generally impossible.

Theorem: No Free Lunch Theorem

For any algorithm $A$ that maps from a samples set to a hypothesis, and any $m, \epsilon &gt; 0$, there exists a distribution $D$ such that $L(h_\text{Bayes})=0$ and

$$\mathbb{E}_{S \sim D^m}(\hat{L}_S(A(S))) \geq \frac{1}{2} - \epsilon$$

There's a couple important observations we can make about this fact:

  • The term inside the expectation looks very similar to the term inside the expectation from our definition of generalization error from empirical error.
    • In fact, its actually a stronger result! Instead of fixing the hypothesis, as it was in the definition of generalization bound, we instead fix the algorithm that determines the hypothesis itself. This means that this holds for any hypothesis that can be generated from the algorithm.
  • The theorem effectively states that regardless of what hypothesis we can build, there exists a distribution such that the performance over the distribution is roughly no better than that of a coin flip.

The major consequence of this algorithm is that it's useless to attempt to bound excess error over all possible distributions and hypotheses.

The Bias-Complexity Tradeoff

How can we reduce the scope of our analysis? One major solution is to restrict our set of searchable hypotheses to a specific hypothesis class.

Definition: Hypothesis Class

A hypothesis class $\mathcal{H}$ is a subset of the set of all hypotheses. More formally:

$$\mathcal{H} \subseteq \lbrace h: \chi \rightarrow \lbrace -1, 1 \rbrace \rbrace$$

We normally restrict our hypothesis class to a class that best represents the model we are trying to fix.

Some examples of hypothesis classes are:

  • Polynomials with degree less than $n$
  • 2-Layer Neural Networks
  • Linear Classifiers

The goal then becomes to compare our given model, which belongs to a class, with the best possible model in that hypothesis class.

Definition: Error Decomposition

Let $h^* = argmin_{h \in \mathcal{H}} L(h)$ be the "best classifier" in a given hypothesis class $\mathcal{H}$. Then, the excess error $L(h) - L_{bayes}$ can be split into two terms as such:

$$L(h) - L_{bayes} = (L(h) - L(h^*)) + (L(h^*) - L_{bayes})$$

The first term is known as the estimation error, and it measures how good our hypothesis is relative to the hypothesis class it's in. The second term is known as the approximation error, which measures how good our hypothesis class is, relative to the optimal hypothesis.

Note: By the No Free Lunch Theorem, we have no way to directly calculate the approximation error. However:

  1. We can still approximate this value somewhat well
  2. If the hypothesis class selected for the problem models the problem well, analysis on the approximation error can be ignored.

Definition: Bias-Complexity Tradeoff

Suppose we plot a graph of the upper bound on the error compared to the complexity of our hypothesis class (defined later). For now, we will define the complexity of our hypothesis class as the size of the class.

  • As we plot the estimation error versus complexity, the estimation error in general will tend to increase. This is because as our hypothesis class gets larger, the upper bound on the error between the optimal hypothesis in the class, and a given fixed hypothesis will increase, as we get better and better optimal hypotheses.
  • Similarly, as we plot the approximation error versus complexity, the approximation error in general will tend to decrease. This is because as our hypothesis class gets larger, it becomes increasingly likely that we include a model in our class that has increasingly comparable performance to the Bayes Classifier.
  • The excess error is the sum of the estimation error and the approximation error. If this is graphed, you'll notice that there exists a minimum value for this error, which is the optimum hypothesis class.
    • Any model with a complexity greater than this optimum class results in a model that is overfit.
    • Any model with a complexity less than this optimum class results in a model that is underfit.
    • Both of these will have excess errors greater than that of the optimum.

Discussion: Is size the best measure of complexity of a given hypothesis class?

  • It's good enough when discussing finite sized hypothesis classes.
  • However, if we expand the hypothesis class size to that of infinite sets (especially those that are uncountable), comparing two uncountable sets causes the definition to fall apart.

Empirical Risk Minimization

Definition: Empirical Risk Minimizing Hypothesis

The empirical risk minimizing hypothesis $\hat{h}$ for a learning problem with a given loss function $\hat{L}$ over a sample $S \sim D^m$ is:

$$\hat{h} = \text{argmin}_{h \in \mathcal{H}} \hat{L}_S(h)$$

This hypothesis $\hat{h}$ is known as the ERM-classifier (and has unique properties that can be exploited).

Proposition (Bound on the ERM-Loss): Let $\hat{h}$ be an ERM-hypothesis on a given training set $S$. Then, the estimation error for the ERM-classifier is given by:

$$L(\hat{h})- L(h^{*}) \leq 2 \text{sup}_{h \in \mathcal{H}} |L(h) - \hat{L}_S(h)|$$

Remarks:

  • There exists an upper bound on the estimation error.
  • This upper bound depends on the sample drawn (due to the second term in the supremum).
  • While the left hand side considers the specific hypothesis, the right hand side does not, it instead bounds the performance with respect to the hypothesis class itself.
    • This implies that the bound is independent of the algorithm or elected hypothesis, it instead depends on the hypothesis class instead.
  • This bound compares the difference in generalization error with a difference between generalization error and empirical error.
  • The upper bound is tightened as the sample approaches the distribution.
    • As the drawn sample gets larger, the empirical errors will slowly begin to converge towards their expectation over $\mathcal{H}$, which is defined to be the generalization error.
    • This vaguely looks like a concentration bound, but that is a discussion for later.

Proof of the Proposition: Begin with the left-hand side: $L(\hat{h})- L(h^{*})$. We then proceed with some simple algebraic manipulation: $$L(\hat{h})- L(h^{*}) \leq L(\hat{h}) - L(h^{*}) + \hat{L}_S(h^{*}) - \hat{L}_S(\hat{h})$$ This addition maintains the inequality because the term $\hat{L}_S(h^{*}) - \hat{L}_S(\hat{h})$ is greater than 0 by the definition of the ERM hypothesis (the ERM hypothesis minimizes the empirical loss).

At this point we can group the terms based on their hypotheses, as such, the RHS of above is equal to: $$|L(\hat{h}) - \hat{L}_S(\hat{h})| + |L(h^{*}) - \hat{L}_S(h^{*})| $$

We can then claim that each of these two absolute value terms are less than the deviation in error incurred by the worst possible classifier in the hypothesis class. More specifically, each of the two terms above are less than: $$\text{sup}_{h \in \mathcal{H}} |L(h) - \hat{L}_S(h)| $$

As each of the two deviation terms are less than this deviation term found above, we can claim that their sum is less than: $$2 \text{sup}_{h \in \mathcal{H}} |L(h) - \hat{L}_S(h)|$$

This completes the proof.

Homework

  1. Graded. Is it possible for an ERM hypothesis $\hat{h}_S\in \mathcal{H}$ on a set $S\subseteq \mathcal{X}$ to have $\widehat{L}_S(\hat{h}_S)=0$ but $L(\hat{h}_S)=1$? Why or why not? Would this be overfitting or underfitting? What is the role of the complexity of $\mathcal{H}$ in this situation?

  2. Graded. Let $\mathcal{H}$ be a hypothesis class, $\hat{h}_S\in \mathcal{H}$ be an ERM hypothesis for a sample $S$, and $h^\star \in \arg\min_{h\in\mathcal{H}} L(h)$. Show that $$\mathbb{E}_{S\sim\mathcal{D}^m}[\widehat{L}_S(\hat{h}_S)]\leq L(h^\star) \leq \mathbb{E}_{S\sim\mathcal{D}^m}[L(\hat{h}_S)].$$

  3. Ungraded, optional. Here are some resources if you would like to study the proof of the no-free-lunch theorem. Most sources prove a simpler version; the one from class takes a bit more work. I highly recommend this illustrated proof. For a textbook version, see Section 5.1 "The No-Free-Lunch Theorem" in Understanding Machine Learning: From Theory to Algorithms. (Don't worry about the PAC-learning stuff in Corollary 5.2).

⚠️ **GitHub.com Fallback** ⚠️