CS7545_Sp23_Lecture_08: Introduction to Rademacher Complexity - mltheory/CS7545 GitHub Wiki
CS 7545: Machine Learning Theory -- Spring 2023
Instructor: Tyler Labonte
Notes for Lecture 08
February 02, 2023
Scribes: Rahul Aggarwal, Ignat Georgiev
Generalization Error Bounds for Finite Sets
We first revisit a key result regarding hypothesis classes:
Theorem 8.1 Let $\mathcal{H}$ be a finite hypothesis class. Then for any $\delta > 0$ with probability $1 - \delta$ over a sample $S \sim D^m$, the following holds for all $h \in \mathcal{H}$
Note that, typically, finite classes are very rare in the real world and only commonly appear when we restrict the number of bits used to represent the parameters in our models. Therefore, we would like to apply Uniform Convergence to infinite hypothesis classes as well. To do this, we need to define another measure of complexity.
Complexity of Infinite Hypothesis Classes
Before we formally define a notion of complexity, we note that many such metrics could potentially be used (and which we might revisit later):
Ideas:
# of parameters
How many random labels can it fit?
Difficulty of approximating functions
Absolute difference in function values.
All these ideas are important, but we focus on (2) and (4).
Rademacher Complexity
The Rademacher complexity of a class of functions measures how expressive the class is by measuring how well the class can fit random noise. To formally define this notion, say we have an infinite hypothesis class $\mathcal{H}$, a training sample $S$, and a class of functions $\mathcal{F}$ such that
Definition 8.4: Given a distribution $D^m$, the Rademacher complexity of $\mathcal{F}$ with respect to $D^m$ is
$$R_m(\mathcal{F}) = \underset{S \sim D^m}{\mathbb{E}}\left[\hat{R}_S(\mathcal{F})\right]$$
Interpretations of Rademacher Complexity
There are two key interpretations of Rademacher Complexity:
We can think of $\sigma$ (the Rademacher random vector) as a partition of the training sample $S$. We know that $\sigma_i \in \lbrace -1, 1 \rbrace$, which means that $\sigma$ partitions function values of $f$ into two groups. With this in mind, we can rewrite (8.3) as
We see now that the inner portion of the expectation asks: "Given a partition $\sigma$, find the maximum difference between groups of function values".
The second interpretation starts with writing $\vec{f} = (f(x_i), \ldots, f(x_m))$. Then, we can rewrite (8.3) as
where we are trying to maximize the correlation between a random vector $\sigma$ and $\vec{f}$. Regardless of interpretation, the key point here is that as $\hat{R}_S(\mathcal{F})$ increases, so too does the complexity of the function class $\mathcal{F}$.
Examples
Example 8.1 Suppose we define a function $f$ such that
Then, let our function class $\mathcal{F} = \lbrace f \rbrace$, and let our training sample $S = \lbrace (0, 1), (1, -1) \rbrace$. The Empirical Rademacher Complexity is therefore
$$\begin{align*}
\hat{R}_S(\mathcal{F}) &= \frac{1}{2} \sum_{\sigma} \left( \frac{1}{4} \sum_{i = 1}^2 \sigma_i f(x_i) \right) && \text{sup is gone as we consider only one function} \\\
&= \frac{1}{8} ((f(0) + f(1)) + (f(0) - f(1)) + (-f(0) + f(1)) + (-f(0) - f(1))) && \text{all permutations of \$\sigma\$} \\\
&= \frac{1}{8} (0 + (-2) + 2 + 0) = 0
\end{align*}$$
Example 8.2 Now suppose we define an additional function $g$ such that
Then, let our function class $\mathcal{G} = \lbrace f, g \rbrace$ while keeping our training sample $S$ the same. This means that we get to choose the function in the supermum. Then, the Empirical Rademacher Complexity is
Remark 8.1 Jensen's Inequality states that for convex $\phi$, $\phi(\mathbb{E}[X]) \le \mathbb{E}[\phi(X)]$. In the same way, for concave $\psi$, $\mathbb{E}[\psi(X)] \le \psi(\mathbb{E}[X])$. This is what we used above for the concave square root.
We notice that because $\sigma$ is a Rademacher random variable (symmetric), when $i \ne j$, $\underset{\sigma}{\mathbb{E}}[\sigma_i \sigma_j] = 0$, and when $i = j$, $\underset{\sigma}{\mathbb{E}}[\sigma_i \sigma_j] = \underset{\sigma}{\mathbb{E}}[\sigma_i^2] = 1$. Therefore,