CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 10

February 9, 2023

Scribes: Matthew Lau, Junghwan Lee

Introduction

Why do we care about Rademacher complexity? For one, Rademacher complexity will allow us to study uniform deviation (and bound it!). To recall, for fixed sample $S$, the uniform convergence which we want to minimise is $$\sup_{h\in \mathcal H} |L(h) - \hat L_S(h)|$$ (note: for ease of computation, we will be ignoring the absolute value $|\cdot |$ for the rest of this lecture.)

Earlier in the class, we have seen Rademacher complexity for finite hypothesis classes, but how do we obtain a generalization error bound for an infinite hypothesis class (ie. $|\mathcal H| =\infty$)? In this lecture, we consider a generalised hypothesis class that maps from our feature space to the real values, $\mathcal F \subseteq \{\mathcal X \mapsto \mathbb R\}$. For instance, this generalised hypothesis class $\mathcal F$ can be a "loss" class that maps features to loss values.

Recap

Last time, we analysed hypothesis class $\mathcal H \subseteq \{\mathcal X \mapsto \{-1, 1\}\}$, where our generalisation error (risk) is defined as $L(h) := \mathbb P_{x, y \sim D} \left[h(x) \neq y \right]$ and the empirical error / risk is defined as $$\hat L_S(h) := \frac{1}{m} \sum_{i=1}^{m} \mathbb{1} \left[\{ h(x_i) \neq y_i \} \right]$$ for hypothesis $h\in \mathcal H$. To generalise this notion, we analyse our new hypothesis class $\mathcal F \subseteq \{\mathcal X \mapsto \mathbb R\}$, where our generalisation error (risk) is defined as $L(f) := \mathbb P_{x \sim \mathcal{D}} f(x)$ and the empirical error / risk is defined as $$\widehat L_S(f) := \frac{1}{m} \sum_{i=1}^m f(x_i)$$ for hypothesis $f\in \mathcal F$.

How do we relate uniform convergence to Rademacher complexity? "Symmetrization"

Symmetrization

Theorem 10.1 Symmetrization

Suppose $\mathcal{F} \subseteq {X \rightarrow \mathbb{R}}$. Then,

$$\mathbb E_{S \sim D^m} \left[\sup_{f \in \mathcal{F}} L(f) - \hat{L}_S(f)\right] \leq 2R(\mathcal{F})$$

where

$$R(\mathcal{F}) = \mathbb{E_{S \sim D^m}} \left[\mathbb{E_{\sigma}} \left[\sup_{f \in \mathcal{F}} \frac{1}{m} \sum_{i=1}^{m} \sigma_{i} f(x_i)\right]\right]$$

If $\mathcal{F} \subseteq {X \rightarrow [0,1]}$, there exists a lower bound within a $- \frac{1}{\sqrt{m}} \cdot \text{factor}$

Proof (Reference)

We use two key ideas:

Ghost Sample

$$S' = \{ (x'_ i, y'_ i) \}_{i=1}^m$$

$S'$ is called ghost sample that is an independently drawn from $D$ similar to $S$

We want to study $L(f) - \hat{L}_ S(f)$, which is difficult because we do not know the true distribution of data. Replace $L(f)$ with the equivalent $\mathbb E_{S' \sim D^m} \hat{L}_ {S'}(f)$.

Symmetrization

$$\begin{align*} \mathbb E_{S, S'} \left[\sup_{f \in \mathcal{F}} \frac{1}{m} \sum_{i=1}^{m} (f({x'}_i) - f(x_i))\right] &= \mathbb E_{S, S', \sigma} \left[\sup_{f \in \mathcal{F}} \frac{1}{m} \sum_{i=1}^{m} \sigma (f({x'}_i) - f(x_i))\right] \end{align*}$$

If $\sigma_i=1$, summand does not change. If $\sigma_i=-1$, $f({x'}_i)-f(x) \rightarrow f(x_i) - f({x'}_i)$, therefore this does not change the value because of expectation over $S,S' \sim D^m$ (sum over all possibilities)

For any $$S = {x_1, ..., x_m}$$

$$S' = {{x'}_1, ..., {x'}_m}$$

there exists another draw

$$S' = {x_1, ..., x_m}$$

$$S = {{x'}_1, ..., {x'}_m}$$

Beginning of Proof

We have

$$\begin{align*} & \mathbb E_{S \sim D^m} \left[\sup_{f \in \mathcal{F}} \left[L(f) - \hat{L}_S (f)\right]\right] \\\ &= \mathbb E_{S \sim D^m} \left[\sup_{f \in \mathcal{F}} \left[\mathbb E_{S' \sim D^m}\left[\hat{L}_{S'}(f) - \hat{L}_{S}(f)\right]\right]\right] && \text{Substitute } L(f) = \mathbb E_{S' \sim D^m} \hat{L}_{S'} (f)\\\\\ & \leq \mathbb E_{S,S' \sim D^m} \left[\sup_{f \in \mathcal{F}} \left[ \hat{L}_{S'}(f) - \hat{L}_{S}(f)\right]\right] && \text{using the fact (subadditivity) that} \sup \sum x_i \leq \sum \sup x_i \\\ &= \mathbb E_{S,S' \sim D^m} \left[\sup_{f \in \mathcal{F}} \left[\frac{1}{m} \sum_{i=1}^{m} (f({x_i}') - f(x_i))\right]\right] \\\ &= \mathbb E_{S,S' \sim D^m, \sigma} \left[\sup_{f \in \mathcal{F}} \left[\frac{1}{m} \sum_{i=1}^{m} \sigma_i (f({x_i}') - f(x_i))\right]\right] && \text{Symmetrization} \\\ &\leq \mathbb E_{S,S',\sigma} \left[\sup_{f \in \mathcal{F}} \left[\frac{1}{m} \sum_{i=1}^{m} \sigma_i f({x_i}')\right] + \sup_{f \in \mathcal{F}} \left[\frac{1}{m} \sum_{i=1}^{m} - \sigma_i f(x_i) \right]\right]\\\ &= \mathbb E_{S',\sigma} \left[\sup_{f \in \mathcal{F}} \left[\frac{1}{m} \sum_{i=1}^{m} \sigma_i f({x_i}')\right]\right] + \mathbb E_{S,\sigma} \left[\sup_{f \in \mathcal{F}} \left[\frac{1}{m} \sum_{i=1}^{m} (-\sigma_i) f({x_i}')\right]\right]\\\ &= 2R(\mathcal{F}) \end{align*} $$

Above holds because: $\sigma_i$ and $-\sigma_i$ have the same distribution. $S$ and $S'$ have the same distribution.

Mapping

In fact, let's consider a more specific case where the function class maps onto the interval $[0,1]$ (which occurs naturally in losses like logistic loss), and we can see that we can obtain more results.

Theorem 10.2. Suppose $\mathcal F \subseteq \{ \mathcal X \mapsto [0, 1] \}$. Then, for any $0 < \delta < 1$, with probability $\geq 1-\delta$ over sample $S\sim D^m$, then for all $f\in \mathcal F$, we have that:

$L(f) \leq \hat L_S(f) + 2 \mathfrak R (\mathcal F) + \sqrt{\frac{\log \frac{1}{\delta}}{2m}}$
$L(f) \leq \hat L_S(f) + 2 \hat{\mathfrak R}_S (\mathcal F) + 3\sqrt{\frac{\log \frac{1}{\delta}}{2m}}$

The results we obtain from this theorem are many. For one, we obtain a bound on the generalisation error in terms of the empirical risk, Rademacher complexity and up to a constant defined by how tight we want our bound to be. Although the first result is theoretically interesting, it is often difficult to obtain the actual quantity in practice, due to the difficulty of calculating the Rademacher complexity. Hence, the second result is practically interesting, because we can calculate the empirical Rademacher complexity. Not only that, but using the empirical Rademacher complexity only changes our bound by up to a constant term!

The proof of this theorem will start at the end of these notes, and will continue next lecture. For now, we are more interested to see how this theorem implies a cute corollary about our previous hypothesis class for classification, $\mathcal H \subseteq \{\mathcal X \mapsto \{-1, 1\}\}$.

Corollary. Suppose $\mathcal H \subseteq \{ \mathcal X \mapsto \{-1, 1\} \}$. Then, for any $0 < \delta < 1$, with probability $\geq 1-\delta$ over sample $S\sim D^m$, then for all $h\in \mathcal H$, we have that:

$L(h) \leq \hat L_S(h) + \mathfrak R (\mathcal H) + \sqrt{\frac{\log \frac{1}{\delta}}{2m}}$
$L(h) \leq \hat L_S(h) + \hat{\mathfrak R}_S (\mathcal H) + 3\sqrt{\frac{\log \frac{1}{\delta}}{2m}}$

From the corollary, we can see that we have obtained the result that we wanted at the start of lecture; we have obtained a bound on the generalisation error with Rademacher complexity for a general hypothesis class that does classification, $\mathcal H \subseteq \{\mathcal X \mapsto \{-1, 1\}\}$. We will not prove this corollary, but the proof idea is the following. Let $\mathcal F = \{ \mathcal X \mapsto \mathbb{1}\{h(x)\neq y\}: h\in \mathcal H\}$. Then, transforming our range from $\{0, 1\}$ to $\{-1, 1\}$, we obtain that $\hat {\mathfrak{R}}_S(\mathcal F) =\frac{1}{2}\hat {\mathfrak{R}}_S(\mathcal H)$.

Before we start with the proof of theorem 10.2, we first need McDiarmid's inequality.

Theorem 10.3. McDiarmid's inequality (aka bounded differences). Suppose $g: X^m \rightarrow \mathbb{R}$ and $\exists C_i$, $1 \leq i \leq m$ s.t. $\forall s, s'$ differing in $i$-th component,

$$|g(s) - g(s')| \leq C_i $$

What does it mean?

Suppose,

$$s = (x_1, x_2, ..., x_i, x_{i+1}, ..., x_m) $$

$$s = (x_1, x_2, ..., {x_i}', x_{i+1}, ..., x_m) $$

If I change one data dimension then there would not be a lot of change in function value.

Then, $\forall \epsilon \geq 0$, $$Pr_{S \sim D^m} \left[|g(s) - \mathbb E_{S \sim D^m} g(s)| \geq \epsilon \right] \leq \exp{\frac{-2\epsilon^2}{\sum\limits_ {i=1}^{m}{C_i}^2}}$$

Proof of Theorem 10.2

Let $g(S) := \sup_{f\in \mathcal F} \left[L(f) - \hat L_S(f)\right]$. Then,

$$ \begin{align} g(S') - g(S) &= \sup_{f\in \mathcal F} \left[L(f) - \hat L_{S'}(f)\right] - \sup_{f\in \mathcal F} \left[L(f) - \hat L_{S}(f)\right] \\ &\leq \sup_{f\in \mathcal F} \left[L(f) - \hat L_{S'}(f) - (L(f) - \hat L_{S}(f))\right] & \text{by subadditivity} \\ &= \sup_{f\in \mathcal F} \left[L(f) - \hat L_{S'}(f) - L(f) + \hat L_{S}(f)\right] \\ &= \sup_{f\in \mathcal F} \left[\hat L_{S}(f) - \hat L_{S'}(f) \right] \\ &= \sup_{f\in \mathcal F} \left[ \frac{f(x_i) - f(x_i')}{m} \right] &\text{since $S$ and $S'$ differ only in the $i^{th}$ component}\\ &\leq \frac{1}{m}\\ \end{align} $$

The proof will be continued in the next lecture.

CS7545_Sp23_Lecture_10: Rademacher Complexity Continued - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 10

February 9, 2023

Scribes: Matthew Lau, Junghwan Lee

Introduction

Recap

Symmetrization

Mapping

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp23_Lecture_10: Rademacher Complexity Continued - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 10

February 9, 2023

Scribes: Matthew Lau, Junghwan Lee

Introduction

Recap

Symmetrization

Mapping

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️