CS7545_Sp23_Lecture_08: Introduction to Rademacher Complexity - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 08

February 02, 2023

Scribes: Rahul Aggarwal, Ignat Georgiev

Generalization Error Bounds for Finite Sets

We first revisit a key result regarding hypothesis classes:

Theorem 8.1 Let $\mathcal{H}$ be a finite hypothesis class. Then for any $\delta > 0$ with probability $1 - \delta$ over a sample $S \sim D^m$, the following holds for all $h \in \mathcal{H}$

$$L(h) \le \hat{L}_S(h) + \sqrt{\frac{\log{|\mathcal{H}|} + \log{\frac{1}{\delta}}}{2m}}$$

What we can read from this is that the bigger the hypothesis class, then the more difficult it is for it to converge.

Proof Refer to Lecture 07

Note that, typically, finite classes are very rare in the real world and only commonly appear when we restrict the number of bits used to represent the parameters in our models. Therefore, we would like to apply Uniform Convergence to infinite hypothesis classes as well. To do this, we need to define another measure of complexity.

Complexity of Infinite Hypothesis Classes

Before we formally define a notion of complexity, we note that many such metrics could potentially be used (and which we might revisit later):

Ideas:

  1. # of parameters
  2. How many random labels can it fit?
  3. Difficulty of approximating functions
  4. Absolute difference in function values.

All these ideas are important, but we focus on (2) and (4).

Rademacher Complexity

The Rademacher complexity of a class of functions measures how expressive the class is by measuring how well the class can fit random noise. To formally define this notion, say we have an infinite hypothesis class $\mathcal{H}$, a training sample $S$, and a class of functions $\mathcal{F}$ such that

$$\begin{align*} \mathcal{H} &\subseteq \lbrace h : \mathcal{X} \to \lbrace -1, 1 \rbrace \rbrace \\ S &= \lbrace (x_i, y_i) \rbrace_{i = 1}^m \sim D^m \\ \mathcal{F} &\subseteq \lbrace f : \mathcal{X} \to \mathbb{R} \rbrace \end{align*}$$

Additionally, we define two key concepts:

Definition 8.1: A Rademacher Random Variable takes on values $\pm 1$ and is defined as such:

$$\sigma_i = \begin{cases} \;\;\, 1 & \text{w.p. } \frac{1}{2} \\ -1 & \text{w.p. } \frac{1}{2} \\ \end{cases}$$

Definition 8.2: A Rademacher Random Vector is a vector $\sigma$ such that

$$\sigma = (\sigma_1, \ldots, \sigma_m)$$

Note: We also refer to these as random labels

Definition 8.3: The Empirical Rademacher Complexity of the function class $\mathcal{F}$ with respect to a sample $S$ is

$$\hat{R}_S(\mathcal{F}) = \underset{\sigma}{\mathbb{E}}\left[\sup_{f \in \mathcal{F}} \frac{1}{m} \sum_{i=1}^m \sigma_i f(x_i)\right]$$

Definition 8.4: Given a distribution $D^m$, the Rademacher complexity of $\mathcal{F}$ with respect to $D^m$ is $$R_m(\mathcal{F}) = \underset{S \sim D^m}{\mathbb{E}}\left[\hat{R}_S(\mathcal{F})\right]$$

Interpretations of Rademacher Complexity

There are two key interpretations of Rademacher Complexity:

  1. We can think of $\sigma$ (the Rademacher random vector) as a partition of the training sample $S$. We know that $\sigma_i \in \lbrace -1, 1 \rbrace$, which means that $\sigma$ partitions function values of $f$ into two groups. With this in mind, we can rewrite (8.3) as
$$\hat{R}_S(\mathcal{F}) = \frac{1}{m} \underset{\sigma}{\mathbb{E}}\left[\sup_{f \in \mathcal{F}} \left(\sum_{\sigma_i=1} f(x_i) - \sum_{\sigma_j=-1}f(x_j)\right)\right]$$

We see now that the inner portion of the expectation asks: "Given a partition $\sigma$, find the maximum difference between groups of function values".

  1. The second interpretation starts with writing $\vec{f} = (f(x_i), \ldots, f(x_m))$. Then, we can rewrite (8.3) as
$$\hat{R}_S(\mathcal{F}) = \frac{1}{m} \underset{\sigma}{\mathbb{E}}\left[\sup_{f \in \mathcal{F}} \langle \sigma, \vec{f} \rangle\right]$$

where we are trying to maximize the correlation between a random vector $\sigma$ and $\vec{f}$. Regardless of interpretation, the key point here is that as $\hat{R}_S(\mathcal{F})$ increases, so too does the complexity of the function class $\mathcal{F}$.

Examples

Example 8.1 Suppose we define a function $f$ such that

$$f(x) = \begin{cases} \;\;\, 1 & x \ge 1 \\ -1 & \text{else} \end{cases}$$

Then, let our function class $\mathcal{F} = \lbrace f \rbrace$, and let our training sample $S = \lbrace (0, 1), (1, -1) \rbrace$. The Empirical Rademacher Complexity is therefore

$$\begin{align*} \hat{R}_S(\mathcal{F}) &= \frac{1}{2} \sum_{\sigma} \left( \frac{1}{4} \sum_{i = 1}^2 \sigma_i f(x_i) \right) && \text{sup is gone as we consider only one function} \\\ &= \frac{1}{8} ((f(0) + f(1)) + (f(0) - f(1)) + (-f(0) + f(1)) + (-f(0) - f(1))) && \text{all permutations of $\sigma$} \\\ &= \frac{1}{8} (0 + (-2) + 2 + 0) = 0 \end{align*}$$

Example 8.2 Now suppose we define an additional function $g$ such that

$$g(x) = \begin{cases} \;\;\, 1 & x \ge 0 \\ -1 & \text{else} \end{cases}$$

Then, let our function class $\mathcal{G} = \lbrace f, g \rbrace$ while keeping our training sample $S$ the same. This means that we get to choose the function in the supermum. Then, the Empirical Rademacher Complexity is

$$\begin{align*} \hat{R}_S(\mathcal{G}) &= \frac{1}{8} ((g(0) + g(1)) + (g(0) - g(1)) + (-f(0) + f(1)) + (-f(0) - f(1))) \\ &= \frac{1}{8} (2 + 0 + 2 + 0) = \frac{1}{2} \end{align*}$$

This shows that the function class $\mathcal{G}$ is more complex than the function class $\mathcal{F}$, which intuitively seems true.

Rademacher Complexity for Infinite Classes

Let an infinite function class be defined as such:

$$ \mathcal{F} = \{ x \mapsto \langle w, x \rangle : \|w\|_2 \le B \} $$

where for all $1 \le i \le m, x_i \in \mathbb{R}^n$, $\|x_i\|_2 \le C$

Proposition 8.1 $\hat{R}_S(\mathcal{F}) \le \frac{BC}{\sqrt{m}}$

Note: This is data-dependent but dimensionality independent!

Proof

$$\begin{align*} \hat{R}_S(\mathcal{F}) &= \underset{\sigma}{\mathbb{E}}\left[ \sup_{f \in \mathcal{F}} \frac{1}{m} \sum_{i = 1}^m \sigma_if(x_i) \right] \\\ &= \frac{1}{m} \underset{\sigma}{\mathbb{E}}\left[ \sup_{w : \\|w\\|_2 \le B} \sum_{i = 1}^m \sigma_i \langle w, x_i \rangle \right] \\\ &= \frac{1}{m} \underset{\sigma}{\mathbb{E}}\left[ \sup_{w : \\|w\\|_2 \le B} \langle w, \sum_{i = 1}^m \sigma_i x_i \rangle \right] \\\ &\le \frac{1}{m} \underset{\sigma}{\mathbb{E}}\left[ \sup_{w : \\|w\\|_2 \le B} \|w\|_2 \left\|\sum_{i = 1}^m \sigma_i x_i \right\|_2 \right] && \text{by the Cauchy-Schwarz Inequality} \end{align*}$$

Because we are taking the supremum over $w$, and we know that $\|w\|_2$ is upper bounded by $B$, we can upper bound our quantity as well.

$$\begin{align*} \hat{R}_S(\mathcal{F}) &\le \frac{B}{m} \underset{\sigma}{\mathbb{E}}\left[\left\lVert\sum_{i = 1}^m \sigma_i x_i \right\rVert_2 \right] && \text{by taking the supremum} \\\ &= \frac{B}{m} \underset{\sigma}{\mathbb{E}}\left[\left(\left\|\sum_{i = 1}^m \sigma_i x_i \right\|_2^2\right)^{\frac{1}{2}} \right] \\\ &\le \frac{B}{m} \left(\underset{\sigma}{\mathbb{E}}\left[\left\|\sum_{i = 1}^m \sigma_i x_i \right\|_2^2 \right]\right)^{\frac{1}{2}} && \text{Jensen's Inequality for concave functions} \\\ &= \frac{B}{m} \left(\underset{\sigma}{\mathbb{E}}\left[\sum_{i,j} \langle \sigma_i x_i , \sigma_j x_j \rangle \right]\right)^{\frac{1}{2}} \\\ &= \frac{B}{m} \left(\sum_{i,j} \langle x_i, x_j \rangle \underset{\sigma}{\mathbb{E}} \left[ \sigma_i \sigma_j \right]\right)^{\frac{1}{2}} && \text{By the linearity of expectation} \end{align*}$$

Remark 8.1 Jensen's Inequality states that for convex $\phi$, $\phi(\mathbb{E}[X]) \le \mathbb{E}[\phi(X)]$. In the same way, for concave $\psi$, $\mathbb{E}[\psi(X)] \le \psi(\mathbb{E}[X])$. This is what we used above for the concave square root.

We notice that because $\sigma$ is a Rademacher random variable (symmetric), when $i \ne j$, $\underset{\sigma}{\mathbb{E}}[\sigma_i \sigma_j] = 0$, and when $i = j$, $\underset{\sigma}{\mathbb{E}}[\sigma_i \sigma_j] = \underset{\sigma}{\mathbb{E}}[\sigma_i^2] = 1$. Therefore,

$$\begin{align*} \hat{R}_S(\mathcal{F}) &\le \frac{B}{m} \left(\sum_{i \ne j} \langle x_i, x_j \rangle \underbrace{\underset{\sigma}{\mathbb{E}} \left[ \sigma_i \sigma_j \right]}_{= 0} + \sum_{i = j} \langle x_i, x_j \rangle \underbrace{\underset{\sigma}{\mathbb{E}} \left[ \sigma_i \sigma_j \right]}_{= 1}\right)^{\frac{1}{2}} \\\ &= \frac{B}{m} \left(\sum_{i = j} \langle x_i, x_j \rangle\right)^{\frac{1}{2}} = \frac{B}{m} \left(\sum_{i}^m \\|x_i\\|_2^2 \right)^{\frac{1}{2}} \\\ &\le \frac{B\sqrt{mC^2}}{m} = \frac{BC}{\sqrt{m}} \end{align*}$$
⚠️ **GitHub.com Fallback** ⚠️