CS7545_Sp23_Lecture_06: Intro to Statistical Learning Theory - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 06

January 26, 2023

Scribes: Conner Yurkon, Jules Morata

Statistical Learning Theory

Definition:

Generalization is the ability of a model to extrapolate to unseen data from the training distribution.

Goal:

We want to study the link between training error and test error.

Why should we understand generalization?

  • To guarantee the model performance
  • To guide practical research
  • To better understand training dynamics
  • To better understand distribution shift
  • To guide data collection

Work hypothesis:

Let $\chi$ be an input space, D a distribution on $\chi \times \lbrace -1,1 \rbrace$, and $h : \chi \rightarrow \lbrace -1,1\rbrace$ a hypothesis.

Definition:

The generalization error/risk of h is $L(h) = Pr_{x,y\thicksim D}[h(x) \neq y]$.

We want to find $h^{*} = \underset{h}{\arg \min} L(h)$.

Definition:

Let $\eta(x) = Pr[y=1|x]$. The Bayes classifier is

$$ h_{\text{Bayes}}(x) = \begin{cases} 1 & \text{if } \eta(x) \geq \frac{1}{2}\\ -1 & \text{if } \eta(x) < \frac{1}{2} \end{cases} $$

Fact:

$L_{\text{Bayes}} = L(h_{\text{Bayes}}) = \underset{h}{\min}L(h)$

Proof:

Let $\mathbb{1}(x=y)$ be the indicator function, where

$$ \begin{align*} \mathbb{1}(x=y) = \begin{cases} 1 & \text{if } x=y \\ 0 & \text{if } x \neq y \end{cases} \end{align*} $$

Then,

$$ \begin{align*} L(h)&=Pr_{x,y \thicksim D}[h(x) \neq y]\ &= \mathbb{E}{x,y \thicksim D}[\mathbb{1}(h(x) \neq y)]\ &= \mathbb{E}{x}\mathbb{E}{y}[\mathbb{1}(h(x) \neq y) \mid x]\ &= \mathbb{E}{x}[Pr[y=1|x]\mathbb{1}(h(x) \neq 1) + Pr[y=-1|x]\mathbb{1}(h(x) \neq -1)] && \text{Law of Total Expectation / Tower Rule} \ &= \mathbb{E}_{x}[\eta(x)\mathbb{1}(h(x) = -1) + (1-\eta(x))\mathbb{1}(h(x) = 1)]. \end{align*} $$

To minimize the expression, we choose

$$ h(x) = \begin{cases} 1 & \text{if } \eta(x) \geq 1-\eta(x)\\ -1 & \text{if } \eta(x) < 1-\eta(x) \end{cases} \quad i.e.\quad h(x) = \begin{cases} 1 & \text{if } \eta(x) \geq \frac{1}{2}\\ -1 & \text{if } \eta(x) < \frac{1}{2} \end{cases} = h_{\text{Bayes}}(x). $$

In practice, we cannot use $h_{\text{Bayes}}$ because we typically don't know $\eta(x)$ or $D$. It might be computationally infeasible to compute $h_{\text{Bayes}}$ even if we knew $D$.

Work hypothesis:

Let's now suppose we have a training set $S=\lbrace (x_{i},y_{i})\rbrace_{i=1}^{m} \sim D^{m}$.

Definition

The empirical error (or training error) of $h$ on $S$ is $$\hat{L}_ {S}(h) = \frac{1}{m} \sum_ {i=1} ^{m} \mathbb{1}(h(x_ {i}) \neq y_ {i}).$$

Note that $L(h) = \mathbb{E}_ {S \thicksim D^m} \hat{L}_{S}(h)$.

Ultimately, we want to bound $L(h) - L_{\text{Bayes}}$.

No Free Lunch Theorem & Implications

No Free Lunch Theorem: For any algorithm $A$ mapping from samples to hypotheses, any $m\in \mathbb{N}$, and any $\epsilon &gt; 0$, there exists a distribution $D$ such that $L_{\text{Bayes}} = 0$ and $\mathbb{E} _{S \thicksim D^m}\hat{L}_S(A(S)) \geq \frac{1}{2} - \epsilon$.

The implication of this theorem is there is no universal learning algorithm. Additionally, additional assumptions or constraints are required to develop meaningful guarantees regarding generalization. The proof of a slightly different formulation may be found in Understanding Machine Learning by Shai Shalev-Shwartz and Shai Ben-David, Theorem 5.1, page 61. The stated formulation can be found in A Probabilistic Theory of Pattern Recognition by Luc Devroye, László Györfi, and Gábor Lugosi.

Let $\mathcal{H}$ be a hypothesis class containing $h : \chi \to \lbrace -1, 1 \rbrace$. For example, $\mathcal{H}$ might be the set of all neural networks, linear classifiers, or polynomials. Let $h^* = \underset{h \in \mathcal{H}}{\arg \min} L(h)$. Then,

$$ \begin{align*} L(h) - L_{\text{Bayes}} = (\underbrace{L(h) - L(h^{*})}\textrm{estimation error}) + (\underbrace{L(h^{*}) - L{\text{Bayes}}}_\textrm{approximation error}) \end{align*} $$

where the estimation error measures how close we are to the best $h^{*} \in \mathcal{H}$ and the approximation error measures how good the class $\mathcal{H}$ is. This decomposition is illustrated below (green and orange denote estimation and approximation errors, respectively).

error1

Increasing the size of $\mathcal{H}$ increases the estimation error and decreases the approximation error.

Bias-Variance Tradeoff

We want to pick $\mathcal{H}$ that minimizes the total error. The relationship between estimation, approximation, and total errors is illustrated below. For intuition, see Occam's Razor. You can also see Bias-Variance Tradeoff

error2

⚠️ **GitHub.com Fallback** ⚠️