CS7545_Sp23_Lecture_14: Sauer Shelah Lemma - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler LaBonte

Notes for Lecture 14

February 23, 2023

Scribes: Chunhao Zou, Zheyi Zhang

This is the final lecture on VC-Dimension, where we will discuss the properties of the growth function, specifically how fast it grows. Furthermore, we want to integrate the properties of the growth function with Rademacher Complexity to derive a generalization error bound for a hypothesis class.

Recap

The following concepts are used in the lecture.

Growth Function: Recall the definition of growth function for a hypothesis class $\mathcal H$ is $\Pi_{\mathcal H}(m) = \max_{S \subseteq X, |S| = m} |\mathcal H|_s|$

We saw earlier that the growth function is bounded by $\Pi_{\mathcal H}(m) \leq 2^m$ for a binary function.

However, we still haven't analyzed the behavior of our growth function: how fast does it grow? What can we know about the growth function of a hypothesis class if we know its VC-Dimension, and vice versa?

Binomial Theorem: $(a+b)^m = \Sigma_{i=0}^{m}\binom{m}{i}a^{m-i}b^i$

Special case: $(1+x)^m = \Sigma_{i=0}^{m}\binom{m}{i}x^i$.

Massart's Lemma: If $|A|<\infty$, $R(A)\leq\max _{a \in A}||a||_2\frac{\sqrt{2 \log |A|}}{m}$.

Sauer-Shelah Lemma

The Sauer-Shelah Lemma (discovered by two mathematicians in 1974 independently) is the following:

Suppose $VC_{dim}(\mathcal H)=d<\infty$, then $\forall m \in \mathbb N$,

$$\Pi_\mathcal H(m) \leq \Sigma_{i=0}^d \binom{m}{i}$$

The proof for the lemma is in the textbook (not discussed in class).

Why is Sauer-Shelah Lemma important? What are the implications of the lemma?

If we know the VC-Dimension of a hypothesis class, the lemma would give us a much better bound on the growth function of a hypothesis class than $\Pi_\mathcal H(m) \leq2^m$.

Corollary: For $m\geq d$,

$$\Pi_\mathcal H(m) \leq \left(\frac{em}{d}\right)^d=O(m^d)$$

Interpretation of Corollary:

The growth function has two behaviors.

  1. If $VC_{dim}(\mathcal H)=\infty$, then $\Pi_\mathcal H(m) = 2^m$.
  2. If $VC_{dim}(\mathcal H)=d < \infty$, then $\Pi_\mathcal H(m) = 2^m$ for $m< d$, $\Pi_\mathcal H(m) = O(m^d)$ for $m\geq d$.

Previously, from the definition of growth function, we only knew that $\Pi_\mathcal H(m) = 2^m$ for $m < d$. However, the Sauer-Shelah Lemma tells us about the behavior of the growth function for a value of m that is greater than d! In a few seconds, we will give the proof that the growth function increases polynomially (or in more rigorous terms, bound by a polynomial) when $m\geq d$.

Another interpretation of the corollary is that a hypothesis class with finite VC-Dimension can fit a polynomial number of samples; a hypothesis class with infinite VC-Dimension can fit an exponential number of samples. Only hypothesis classes with finite VC-Dimensions can be trained with a polynomial number of samples, otherwise, if a hypothesis class has infinite VC-Dimension, you need an exponential number of samples to train it.

Proof of Corollary:

Assume $m \geq d$,

$$ \begin{align*} &\Pi_\mathcal H(m) \leq \Sigma_{i=0}^d \binom{m}{i} &&\text{(From the definition of Sauer-Shelah Lemma)}\\ &\leq \Sigma_{i=0}^d \binom{m}{i}(\frac{m}{d})^{d-i} &&\text{($(\frac{m}{d})^{d-i}$ is always greater than or equal to 1)}\\ &\leq \Sigma_{i=0}^m \binom{m}{i}(\frac{m}{d})^{d-i} &&\text{($m\geq d$, we replace d with m)}\\ &=(\frac{m}{d})^{d}\Sigma_{i=0}^m \binom{m}{i}(\frac{m}{d})^{-i}=(\frac{m}{d})^{d}\Sigma_{i=0}^m \binom{m}{i}(\frac{d}{m})^{i} &&\text{(We want to construct an equation that is similar to the binomial theorem)}\\ &=(\frac{m}{d})^{d}\left(1+\frac{d}{m}\right)^m &&\text{(Binomial Theorem, see Recap)}\\ &=\left(\frac{m}{d}\right)^{d}\left(\left(e^{\frac{d}{m}}\right)\right)^m=\left(\frac{em}{d}\right)^{d}=O(m^d)\\ \end{align*} $$

Q.E.D.

Relationship Between Rademacher Complexity and VC-Dimension

Proposition: $\mathfrak{R}(\mathcal{H}) \le \sqrt{\frac{2\log\Pi_\mathcal{H}(m)}{m}}$

Proof: The proof used Massart's Lemma:

$$ \mathfrak{R}(A) = \frac{1}{m}\mathbb{E}_\sigma\sup _{a\in A}\sum _{i=1}^m \sigma_i a_i \le \max _{a\in A} ||a||_2 \frac{\sqrt{2\log|A|}}{m} $$

Recall $\mathcal{H}|_S = {(h(x_1), h(x_2), \dots, h(x_m)) : h\in \mathcal{H}}$

We have

$$\max _{u\in \mathcal{H}|_S} ||u||_2 = \max _{h\in \mathcal{H}} || (h(x_1), h(x_2), \dots, h(x_m))||_2 = \sqrt{m}$$

Note that $h(x_i)$ is always either $+1/-1$.

In other words, any element in this set can reach this maximum.

$$ \begin{align*} \mathfrak{R}(\mathcal{H}) &= \mathbb{E} _{S\sim D^m} \mathbb{E} _\sigma \left[ \sup _{h\in\mathcal{H} }\frac{1}{m}\sum _{i=1}^m \sigma_i h(x_i) \right] \\ &= \mathbb{E} _{S\sim D^m} \mathbb{E} _\sigma \left[ \sup _{u\in\mathcal{H}|_S }\frac{1}{m}\sum _{i=1}^m \sigma_i u_i \right] \\ &\le \mathbb{E} _{S\sim D^m} \sqrt{\frac{2\log|\mathcal{H}|_S|}{m}} && \text{using Massart's Lemma} \\ &\le \sqrt{\frac{2\log\Pi _\mathcal{H}(m)}{m}} && \forall S, |\mathcal{H}|_S| \le \Pi _\mathcal{H}(m) \end{align*} $$

Derivation of a Generalization Error Upper Bound w.r.t. VC-Dimension

Theorem: Let $\mathcal{H}\subset \{h: \mathcal{X}\to \{-1,+1\}\}$ be a class of hypotheses, with VC-Dimension $VC_{dim}(\mathcal{H}) = d$, then $\forall \delta >0$, with probability $\ge 1-\delta$ over $S\sim D^m, \forall h\in \mathcal{H}$, we have $$L(h)\le \hat{L}_S(h)+\sqrt{\frac{2d\log\frac{em}{d}}{m}} + \sqrt{\frac{\log(1/\delta)}{2m}}$$

Proof:

with probability $\ge 1-\delta$,

$$ \begin{align*} L(h) &\le \hat{L} _S(h) + \mathfrak{R}(\mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2m}} &&\text{Theorem 2 from Lecture 10}\\\ &\le \hat{L} _S(h) + \sqrt{\frac{2\log\Pi _\mathcal{H}(m)}{m}} + \sqrt{\frac{\log(1/\delta)}{2m}} &&\text{using the last Theorem}\\ &\le \hat{L} _S(h) + \sqrt{\frac{2\log(\frac{em}{d})^d}{m}} + \sqrt{\frac{\log(1/\delta)}{2m}} && \text{using Sauer-Shelah Lemma}\\ &= \hat{L} _S(h) + \sqrt{\frac{2d\log\frac{em}{d}}{m}} + \sqrt{\frac{\log(1/\delta)}{2m}} \end{align*} $$

Fact: During the lecture, we demonstrated the existence of a one-sided upper generalization error bound. However, there is an even stricter bound with respect to $d$, the VC-Dimension of a hypothesis class, that makes error approximation more precise and convenient. (We won't be presenting the proof in this lecture. Nevertheless, for those who are keen to delve deeper, feel free to explore the topic on your own.)

This tighter bound is:

$$L(h)\leq \hat L_s (h)+O\left(\sqrt\frac{d}{m}\right)$$

Interpretation of Fact: when $d\ge m$, this bound is vacuous (useless) as we already have $L(h)\le 1$; when $d\lt m$, that is, when VC-Dimension of the hypotheses class is less than number of data points we have, this bound is non-vacuous (useful). Therefore, for higher VC-Dimension, more data is needed. In the special case $d=\infty$, we have $d>m$ for any $m$, which means this bound is always vacuous no matter how much data we have. In other words, however much data we use, the model always overfits.

Implications on ML Model Selection: Given a fixed dataset(and thus a fixed $m$), we want to choose a hypotheses class $\mathcal{H}$ with $VC_{dim}(\mathcal{H}) \le m$. Moreover, we want to pick a hypothesis $h\in\mathcal{H}$ that minimizes $\hat{L}_S(h) + O(\frac{d}{m})$. In other words, we want to find a simple enough class of models that is still to give a decently small empirical error. This relates to the Bias-Variance Trade-off we discussed in lecture 6.

error2

Rademacher Complexity and Regularization

Scribe's Note: This section exists because a student asked a question about regularization in class. The content here would fit more naturally in lecture 10 and 11, where we proved that $L(h) \le \hat{L}_S(h) + \mathfrak{R}(\mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2m}}$. We see that real generalization risk of a function, which is its true performance, is bounded by a combination of the empirical "error rate" the Rademacher Complexity, and a constant.

Suppose we have m samples, and for all $1 \le i \le m, x_i \in \mathbb{R}^n$, $||x_i||_2 \le C$. In lecture 8, we have shown that $\hat{R}_S(\mathcal{F}) \le \frac{BC}{\sqrt{m}}=O(B)$ for $\mathcal{F} = \{x \mapsto \langle w, x \rangle : ||w||_2 \le B\}$.

Here, we clearly see the theoretical foundation of regularization. Restricting the norm of weight would give us a "simpler" hypothesis class with smaller Rademacher Complexity, even if it might hurt $\hat{L}_S(h)$. Again, this relates to the Bias-Variance Trade-off we discussed in lecture 6.

This is also why we choose the loss function to be a combination of empirical error and the norm of weight. Rather than minimizing $\hat L_s(h)$, we are actually minimizing an upper bound for $L(h)$!

⚠️ **GitHub.com Fallback** ⚠️