CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 06

January 26, 2023

Scribes: Conner Yurkon, Jules Morata

Statistical Learning Theory

Definition:

Generalization is the ability of a model to extrapolate to unseen data from the training distribution.

Goal:

We want to study the link between training error and test error.

Why should we understand generalization?

To guarantee the model performance
To guide practical research
To better understand training dynamics
To better understand distribution shift
To guide data collection

Work hypothesis:

Let $\chi$ be an input space, D a distribution on $\chi \times \lbrace -1,1 \rbrace$, and $h : \chi \rightarrow \lbrace -1,1\rbrace$ a hypothesis.

Definition:

The generalization error/risk of h is $L(h) = Pr_{x,y\thicksim D}[h(x) \neq y]$.

We want to find $h^{*} = \underset{h}{\arg \min} L(h)$.

Definition:

Let $\eta(x) = Pr[y=1|x]$. The Bayes classifier is

$$ h_{\text{Bayes}}(x) = \begin{cases} 1 & \text{if } \eta(x) \geq \frac{1}{2}\\ -1 & \text{if } \eta(x) < \frac{1}{2} \end{cases} $$

Fact:

$L_{\text{Bayes}} = L(h_{\text{Bayes}}) = \underset{h}{\min}L(h)$

Proof:

Let $\mathbb{1}(x=y)$ be the indicator function, where

$$ \begin{align*} \mathbb{1}(x=y) = \begin{cases} 1 & \text{if } x=y \\ 0 & \text{if } x \neq y \end{cases} \end{align*} $$

Then,

$$ \begin{align*} L(h)&=Pr_{x,y \thicksim D}[h(x) \neq y]\ &= \mathbb{E}{x,y \thicksim D}[\mathbb{1}(h(x) \neq y)]\ &= \mathbb{E}{x}\mathbb{E}{y}[\mathbb{1}(h(x) \neq y) \mid x]\ &= \mathbb{E}{x}[Pr[y=1|x]\mathbb{1}(h(x) \neq 1) + Pr[y=-1|x]\mathbb{1}(h(x) \neq -1)] && \text{Law of Total Expectation / Tower Rule} \ &= \mathbb{E}_{x}[\eta(x)\mathbb{1}(h(x) = -1) + (1-\eta(x))\mathbb{1}(h(x) = 1)]. \end{align*} $$

To minimize the expression, we choose

$$ h(x) = \begin{cases} 1 & \text{if } \eta(x) \geq 1-\eta(x)\\ -1 & \text{if } \eta(x) < 1-\eta(x) \end{cases} \quad i.e.\quad h(x) = \begin{cases} 1 & \text{if } \eta(x) \geq \frac{1}{2}\\ -1 & \text{if } \eta(x) < \frac{1}{2} \end{cases} = h_{\text{Bayes}}(x). $$

In practice, we cannot use $h_{\text{Bayes}}$ because we typically don't know $\eta(x)$ or $D$. It might be computationally infeasible to compute $h_{\text{Bayes}}$ even if we knew $D$.

Work hypothesis:

Let's now suppose we have a training set $S=\lbrace (x_{i},y_{i})\rbrace_{i=1}^{m} \sim D^{m}$.

Definition

The empirical error (or training error) of $h$ on $S$ is $$\hat{L}_ {S}(h) = \frac{1}{m} \sum_ {i=1} ^{m} \mathbb{1}(h(x_ {i}) \neq y_ {i}).$$

Note that $L(h) = \mathbb{E}_ {S \thicksim D^m} \hat{L}_{S}(h)$.

Ultimately, we want to bound $L(h) - L_{\text{Bayes}}$.

No Free Lunch Theorem & Implications

No Free Lunch Theorem: For any algorithm $A$ mapping from samples to hypotheses, any $m\in \mathbb{N}$, and any $\epsilon > 0$, there exists a distribution $D$ such that $L_{\text{Bayes}} = 0$ and $\mathbb{E} _{S \thicksim D^m}\hat{L}_S(A(S)) \geq \frac{1}{2} - \epsilon$.

The implication of this theorem is there is no universal learning algorithm. Additionally, additional assumptions or constraints are required to develop meaningful guarantees regarding generalization. The proof of a slightly different formulation may be found in Understanding Machine Learning by Shai Shalev-Shwartz and Shai Ben-David, Theorem 5.1, page 61. The stated formulation can be found in A Probabilistic Theory of Pattern Recognition by Luc Devroye, László Györfi, and Gábor Lugosi.

Let $\mathcal{H}$ be a hypothesis class containing $h : \chi \to \lbrace -1, 1 \rbrace$. For example, $\mathcal{H}$ might be the set of all neural networks, linear classifiers, or polynomials. Let $h^* = \underset{h \in \mathcal{H}}{\arg \min} L(h)$. Then,

$$ \begin{align*} L(h) - L_{\text{Bayes}} = (\underbrace{L(h) - L(h^{*})}\textrm{estimation error}) + (\underbrace{L(h^{*}) - L{\text{Bayes}}}_\textrm{approximation error}) \end{align*} $$

where the estimation error measures how close we are to the best $h^{*} \in \mathcal{H}$ and the approximation error measures how good the class $\mathcal{H}$ is. This decomposition is illustrated below (green and orange denote estimation and approximation errors, respectively).

error1

Increasing the size of $\mathcal{H}$ increases the estimation error and decreases the approximation error.

Bias-Variance Tradeoff

We want to pick $\mathcal{H}$ that minimizes the total error. The relationship between estimation, approximation, and total errors is illustrated below. For intuition, see Occam's Razor. You can also see Bias-Variance Tradeoff

error2

CS7545_Sp23_Lecture_06: Intro to Statistical Learning Theory - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 06

January 26, 2023

Scribes: Conner Yurkon, Jules Morata

Statistical Learning Theory

No Free Lunch Theorem & Implications

Bias-Variance Tradeoff

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp23_Lecture_06: Intro to Statistical Learning Theory - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 06

January 26, 2023

Scribes: Conner Yurkon, Jules Morata

Statistical Learning Theory

No Free Lunch Theorem & Implications

Bias-Variance Tradeoff

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️