CS7545_Sp24_Lecture_11 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 11

February 20, 2024

Scribes: Alex Hobby, Matthew Spillman

Aside: PAC Learning

Question. What is the relationship between PAC Learning and what we've been looking at so far?

Answer. There are two main approaches to learning theory:

  1. Statistical Learning Theory (SLT): What is the generalization performance $\epsilon$ of our model based on a fixed dataset / model class of size $m$?

  2. Computational Learning Theory (CLT): How much data $m$ is needed to achieve a fixed performance $\epsilon$ using our model?

For this class, we focus on statistical learning theory, and we therefore focus on developing generalization bounds. These are equivalent to "sample complexity bounds" in CLT, which are the same, but we solve in terms of $m$ instead of $\epsilon$.

For example, recall for uniform Hoeffding's, we at one point had:

$$|\mathcal{H}|\exp (-2m\epsilon^2) = \delta$$

In SLT, we solved for $\epsilon$. In CLT, we solve for $m$.

PAC (Probably Approximately Correct) Learning is a CLT framework that focuses on answering: how much data do we need to guarantee model performance $\epsilon$ with probability 1 - $\delta$? This takes a more computer science approach to bounds.

Recall from end of Lecture 10

McDiarmid's Inequality

Suppose that we have a function $g: \mathcal{X}^m \rightarrow \mathbb{R}$, and $\exists c_1, c_2, ..., c_m > 0$ such that $\forall$ samples $S$ and $S'$ which differ only in the $i$'th element:

$$|g(S) - g(S')| \leq c_i \qquad (\Delta)$$

Then, $\forall \epsilon > 0$,

$$\underset{S \sim D^m}{Pr} \left[g(s) - \underset{S \sim D^m}{\mathbb{E}}[g(s)] \geq \epsilon \right] \leq \exp\left(\frac{-2\epsilon ^2}{\Sigma_{i=1}^{m}c_i^2}\right)$$

$(\Delta)$ is called the "Bounded Differences Condition."

We then set out to prove:

Theorem

Suppose $\mathcal{F} \subseteq { f: \mathcal{X} \rightarrow [0, 1]}$. Then, $\forall \delta > 0$ with probability at least $1 - \delta$ over $S \sim D^m$, $\forall f \in F$, we have:

$$L(f) \leq \hat{L_s}(f) + 2\mathfrak{R}(\mathcal{F}) + \sqrt{\frac{\log \frac{1}{\delta} }{2m}} \qquad (1) $$

$$L(f) \leq \hat{L_s}(f) + 2\hat{\mathfrak{R}}_s(\mathcal{F}) + 3\sqrt{\frac{\log \frac{2}{\delta}}{2m}} \qquad (2)$$

Resumed Proof

Previously, we showed that $g(s) = \sup_{f \in \mathcal{F}} L(f) - \hat{L_s}(f)$ satisfies the Bounded Differences Condition $(\Delta)$ with $c_i = \frac{1}{m}$, $\forall i$.

Plugging into McDiarmid's Inequality:

$$\underset{S \sim D^m}{Pr} \bigg[ g(s) - \underset{S \sim D^m}{\mathbb{E}} [g(s)] \geq \epsilon\bigg] \leq \exp (-2m \epsilon^2)$$

Similar to our proof of uniform Hoeffding's, set the right-hand side of this equation equal to $\delta$ and solve for $\epsilon$.

$$\begin{align*} \delta &= \exp (-2m\epsilon^2) \\\ \log \delta &= -2m \epsilon^2 \\\ \epsilon&= \sqrt{\frac{\log (1 / \delta)}{2m}} \end{align*}$$

Plugging back in:

$$\underset{S \sim D^m}{Pr} \left[ g(s) - \underset{S \sim D^m}{\mathbb{E}} [g(s)] \geq \sqrt{\frac{\log (1 / \delta)}{2m}} \right] \leq \delta$$

The following is now true with probability $\geq 1 - \delta$: $$g(s) \leq \underset{S \sim D^m}{\mathbb{E}} [g(s)] + \sqrt{\frac{\log (1 / \delta)}{2m}} $$

Let's look at $\underset{S \sim D^m}{\mathbb{E}} [g(s)]$, and conclude via the Symmetrization Theorem:

$$\underset{S \sim D^m}{\mathbb{E}} [g(s)] = \underset{S \sim D^m}{\mathbb{E}} \bigg[\sup_{f \in \mathcal{F}} L(f) - \hat{L_s}(f)\bigg] \leq 2\mathfrak{R}(f)$$

We now have:

$$ \sup_{f \in \mathcal{F}} \bigg[L(f) - \hat{L_s}(f) \bigg] \leq 2\mathfrak{R}(f) + \sqrt{\frac{\log (1 / \delta)}{2m}} $$

And finally, we conclude:

$$ L(f) \leq \hat{L_s}(f) + 2\mathfrak{R}(f) + \sqrt{\frac{\log (1 / \delta)}{2m}} $$

Question. Can we get a generalization bound for binary classification for hypothesis classes $\mathcal{H} = \{ h : \mathcal{X} \rightarrow \{-1, 1\} \}$ when $|\mathcal{H}| = \infty$?

Corollary

Suppose $\mathcal{H} = \{ h : \mathcal{X} \rightarrow \{-1, 1\} \}$. Then, $\forall \delta > 0$ with probability $\geq 1 - \delta$, $S \sim D^m$, we have $\forall h \in \mathcal{H}$:

$$L(h) \leq \hat{L_s}(h) + \mathfrak{R}(\mathcal{F}) + \sqrt{\frac{\log \frac{1}{\delta}}{2m}} \qquad (3)$$

$$L(h) \leq \hat{L}_s(h) + \hat{\mathfrak{R}}_s(\mathcal{F}) + 3\sqrt{\frac{\log \frac{2}{\delta}}{2m}} \qquad (4) $$

Proof Sketch: Create mapped set $f = \{(x, y) \mapsto \mathbb{1}(h(x) \neq y) : h \in \mathcal{H}\}$, which we previously defined as the "loss class" with respect to the $0, 1$ loss. This enables us to move from our hypothesis class which labels data points from $\{-1, 1\}$ to a set which bounds our outputs in $[0, 1]$, which is a requirement to apply the previous theorem.

Remark 1. This is what we wanted! A bound on binary classification!

Remark 2. This works for any bounded loss $l$, just set $f = \{(x, y) \mapsto c \cdot l(h(x), y) : h \in \mathcal{H}\}$. Importantly, it must be bounded so that $c$ can scale the loss outputs to be within the $[0, 1]$ range.

Remark 3. History lesson: This technique of using Rademacher Complexity to form a generalization bound was pioneered by Vladimir Koltchinskii, a professor at Georgia Tech, as well as others including Peter Bartlett, Prof. Abernethy's advisor.

Remark 4. Bound (4) does not depend on $D$, only on $S$! As a result, we can actually compute this bound with a drawn sample in order to get a numerical bound on the expected risk of our classifier over the test sets. This is significant, as it means we get a certified stamp of approval that our model will perform well in expectation for any random sample of data we can throw at it, which is important for deployed settings such as self-driving cars. However, this is unfortunately an NP-hard problem, which motivates the development of a new complexity measure.

VC-Dimension

VC-Dimension is another complexity measure closely related to Rademacher complexity, which actually turns out to be a more specific case. Interestingly, VC Dimension was studied first, then Rademacher, and then we returned to VC Dimension after realizing it helps deal with some problems.

Some of its important characteristics include:

  • Purely combinatorial (no randomness, expectations, etc.)
  • Distribution-free (data-independent)
  • Easy to compute for many interesting classes
  • Can be finite or infinite. Finite-ness characterizes "learnability", which we will not formally define and is outside the scope of the class as it is more related to CLT. However, learnability loosely means our model can reach $\epsilon$-accuracy with a polynomial number of samples.

Definition. The restriction of $\mathcal{H} = \{ h : \mathcal{X} \rightarrow \{-1, 1\} \}$ on a sample $S \subseteq \mathcal{X}$ of size $m$ is: $$\mathcal{H}\rvert_{s} = \{(h(x_1), h(x_2), ..., h(x_m)): h \in \mathcal{H}\}$$ We can think of this as the set of all the ways to classify our sample $S$ using hypotheses from $\mathcal{H}$. Note: this is a finite set with at most $2^m$ elements as this represents the total number of unique ways to classify $m$ points. In general, with a label space of size $y$, $|\mathcal{H}\rvert_s| \leq y^m$. If $y = \infty$, then this becomes regression, which is much more difficult to analyze and methods that we have studied in this course no longer apply.

Definition. The growth function $\Pi_\mathcal{H}(m): \mathbb{N} \rightarrow \mathbb{N}$ is the maximum number of ways to classify $m$ points using functions in $\mathcal{H}$: $$\Pi_\mathcal{H}(m) = \max_{S \subseteq X, |S|=m} |\mathcal{H}\rvert_s|.$$ We can think of this as the biggest possible restriction over the set of possible samples within our input space.

Proposition. $\mathfrak{R}(\mathcal{H}) \leq \sqrt{\frac{2 \log \Pi_\mathcal{H}(m)}{m}}$. Note: this provides a relationship between Rademacher complexity and VC-Dimension (which depends on $\Pi_\mathcal{H}(m)$, as we shall see in a moment).

Proof. We will prove this statement using Massart's Finite Class Lemma. Since every $h \in \mathcal{H}$ takes values in $\{-1, 1\}$, we have:

$$\max_{h \in \mathcal{H}}\\|(h(x_1), h(x_2), ..., h(x_m))\\|_2 = \max_{u \in \mathcal{H}\rvert_s} \|u\|_2 = \sqrt{m}$$

This is because all vectors of length $m$ containing elements in $\{-1, 1\}$ have $\ell_2$-norm equal to $\sqrt{m}$. Using definition of Rademacher Complexity:

$$\begin{align*} \mathfrak{R}(\mathcal{H}) &= \mathbb{E}_s \bigg[\mathbb{E}_\sigma \bigg[\sup_{h \in \mathcal{H}} \frac{1}{m} \sum_{i=1}^m \sigma_i h(x_i)\bigg]\bigg]\\\ &= \mathbb{E}_s \bigg[\mathbb{E}_\sigma \bigg[\sup_{u \in \mathcal{H}\rvert_s} \frac{1}{m} \sum_{i=1}^m \sigma_i u_i\bigg]\bigg] \end{align*}$$

Now, by Massart's Finite Class Lemma:

$$\begin{align*} &\leq \mathbb{E}_s \bigg[\max_{u \in \mathcal{H}\rvert_s} \|u\|_2 \frac{\sqrt{2\log |\mathcal{H}\rvert_s|}}{m}\bigg]\\\ &= \mathbb{E}_s \bigg[\sqrt{\frac{2\log |\mathcal{H}\rvert_s|}{m}}\bigg] \end{align*}$$

By definition of growth function, since $|\mathcal{H}\rvert_s| \leq \Pi_\mathcal{H}(m)$:

$$\leq \mathbb{E}_s \bigg[\sqrt{\frac{2\log \Pi_\mathcal{H}(m)}{m}}\bigg]$$

Finally, since nothing on the inside of the expectation depends on $S$, we can conclude:

$$\mathfrak{R}(H) \leq \sqrt{\frac{2\log \Pi_\mathcal{H}(m)}{m}}.$$

Formal Definition of VC-Dimension

Definition. A set of $m \geq 1$ points is shattered by $\mathcal{H}$ if $\Pi_\mathcal{H}(m) = 2^m$. This means that the hypothesis class $\mathcal{H}$ is able to classify $S$ in all possible ways.

Definition. The VC-Dimension $VC(\mathcal{H})$ is the size of the largest set that $\mathcal{H}$ can shatter: $$VC(\mathcal{H}) = \max \{ m: \Pi_\mathcal{H}(m) = 2^m \}.$$ Note about proofs: to conclude that $VC(\mathcal{H}) = d$, one must prove both that $\exists$ a set of size $d$ which is shattered by $\mathcal{H}$ AND $\not\exists$ a set of size $d+1$ which is shattered by $\mathcal{H}$.

Homework

In lecture we studied the growth function for classes of functions taking values in the set $\{-1, 1\}$, but the same definition applies to classes of functions taking values in the finite set $\mathcal{Y}$. In this case, $\Pi_{\mathcal{H}}(m) \leq |\mathcal{Y}|^m$ (analogous to $2^m$ in the original setup).

  1. Graded. Let $\mathcal{H}_1 \subseteq \{h : \mathcal{X} \to \mathcal{Y}_1\}$ and $\mathcal{H}_2 \subseteq \{h : \mathcal{X} \to \mathcal{Y}_2\}$ be function classes and let $\mathcal{H}_3 \subseteq \{h : \mathcal{X} \times \mathcal{X} \to \mathcal{Y}_1 \times \mathcal{Y}_2\}$ such that $\mathcal{H}_3 = \{(h_1,h_2) : h_1 \in \mathcal{H}_1, h_2 \in \mathcal{H}_2\}$. Show that $$\Pi_{\mathcal{H}_3}(m) = \Pi_{\mathcal{H}_1}(m) \cdot \Pi_{\mathcal{H}_2}(m).$$

  2. Graded. Let $\mathcal{H}_1 \subseteq \{h : \mathcal{X} \to \mathcal{Y}_1\}$ and $\mathcal{H}_2 \subseteq \{h : \mathcal{Y}_1 \to \mathcal{Y}_2\}$ be function classes and let $\mathcal{H}_3 \subseteq \{h : \mathcal{X} \to \mathcal{Y}_2\}$ such that $\mathcal{H}_3 = \{h_2 \circ h_1 : h_1 \in \mathcal{H}_1, h_2 \in \mathcal{H}_2\}$. Show that $$\Pi_{\mathcal{H}_3}(m) \leq \Pi_{\mathcal{H}_1}(m) \cdot \Pi_{\mathcal{H}_2}(m).$$

  3. Ungraded, optional. Prove that (2) is tight, i.e., exhibit $\mathcal{X}, \mathcal{Y}_1, \mathcal{Y}_2, \mathcal{H}_1, \mathcal{H}_2, m$ such that $$\Pi_{\mathcal{H}_3}(m) = \Pi_{\mathcal{H}_1}(m) \cdot \Pi_{\mathcal{H}_2}(m).$$ Hint. You can take $|\mathcal{X}|=m=1$.

⚠️ **GitHub.com Fallback** ⚠️