CS7545_Sp24_Lecture_09: Rademacher Complexity - mltheory/CS7545 GitHub Wiki
CS 7545: Machine Learning Theory -- Spring 2024
Instructor: Tyler LaBonte
Notes for Lecture 09
February 13, 2024
Scribes: Vedaant Shah, Jerry Xiong
Rademacher Complexity: Setting And Review
Recall that in our setting for defining Rademacher complexity from last time, we are working with an input space $X$, distribution $D$ over $X$, and a function class $\mathcal{F}\subseteq \{f:X\to\mathbb{R}\}$. One way (not mentioned last time) to interpret $\mathcal{F}$ is to view it as representing a class of loss functions, in which we wish to quantify how complex the our loss function class is using Rademacher complexity.
We define a Rademacher random variable as $\sigma_i$ which takes values 1 and -1 with probability 0.5 each. A Rademacher random vector is defined as $\sigma=(\sigma_1,...\sigma_m)$ where each $\sigma_i$ is an independent Rademacher random variable. From here, we say the empirical Rademacher complexity of $\mathcal{F}$ with respect to some iid sample $S=\{x_1,...x_m\}\sim D^m$ is
$$\hat{R_S}(\mathcal{F})=\mathbb{E}_{\sigma} [\sup_{f \in \mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i f(x_i)]$$ Likewise, the general Rademacher complexity of $\mathcal{F}$ on $X$ as $R(F)=\mathbb{E}_{S\sim D^m}[\hat{R_S}(\mathcal{F})]$.
Today, we will use this same setting to analyze some examples and properties of Rademacher complexity.
Rademacher Complexity: Basic Examples
Now, let us consider some basic examples of Rademacher complexity.
As the first example, let $X\subseteq \mathcal{R}, S=\{0,1\}, \mathcal{F}=\{f\}$ where $f(x)$ is defined as $1$ when $x\geq 1$ and $-1$ otherwise. We have $$\hat{R_S}(\mathcal{F})=\mathbb{E}_{\sigma}[\sup_{f \in \mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i f(x_i)]=\frac{1}{2}\mathbb{E}_{\sigma}[\sum_{i=1}^2 \sigma_i f(x_i)]$$ since $m=2$ and $\mathcal{F}$ only contains $f$, allowing us to remove the sup. From here, note that $\sigma$ can be either $(-1, -1), (-1, 1), (1, -1), (1, 1)$, each with probability $\frac{1}{4}$, allowing us to expand the expectation to get
$$\frac{1}{2}(\frac{1}{4}(-1f(0)-1f(1))+\frac{1}{4}(-1f(0)+1f(1))+\frac{1}{4}(1f(0)-1f(1))+\frac{1}{4}(1f(0)+1f(1)))=$$$$\frac{1}{2}*\frac{1}{4}(-1*(-1)-1*1-1*(-1)+1*1+1*-1-1*1+1*(-1)+1*1)=\frac{1}{8}(1-1+1+1-1-1-1+1)=\frac{1}{8}(0)=0$$
As a remark, it can also be shown that the Rademacher complexity will always be 0 if $\mathcal{F}$ contains only 1 function, regardless of what that function is. Additionally, the Rademacher complexity will always be lower bounded by 0, which is thus achieved when $\mathcal{F}$ contains only 1 function.
As another example, consider a function $g$ where $g(x)=1$ when $x\geq 0$ and $-1$ otherwise. Now, let $G=\{f, g\}$ and $S=\{0,1\}$
We have:
$$\hat{R_S}(\mathcal{G})=\mathbb{E}_{\sigma}[\sup_{h \in \mathcal{G}}\frac{1}{m}\sum_{i=1}^m \sigma_i h(x_i)]=\frac{1}{8}\sum_{\sigma}\sup_{h\in G} \sum_{i=1}^2 \sigma_ih(x_i)$$ where the $\frac{1}{8}$ comes from the same logic as above since $m=2$ and each choice of $\sigma$ has probability $\frac{1}{4}$. From here, note that:
Substituting into above, we thus get $$\hat{R_S}(\mathcal{G})= \frac{1}{8}(0+2+0+2)=\frac{1}{2}$$
As another remark, note that the we increased the size of our function class by adding $g$ to $\mathcal{F}$ to get $\mathcal{G}$ and observed an increase in the Rademacher complexity, which intuitively makes sense since we can are taking the sup inside the expectation over a superset of $\mathcal{F}$, so we should be able to produce a value at least as large as before.
Rademacher Complexity: Linear Classifiers
Suppose the function class is the space of linear classifiers:
$$\mathcal{F} = \{ x \to \langle w, x \rangle : | w |_2 \le B\}$$
and $S$ is
$$S = \{x_1, \ldots, x_m : \forall i, | x_i | \le C \}.$$
Then, the Empirical Rademacher complexity is bounded by
$$\mathfrak{R}_S(\mathcal{F}) \le \frac{BC}{\sqrt m}.$$
Although we have seen some examples, you may ask why Rademacher complexity is useful for quantifying the "complexity" of function classes. Below are the two main reasons that we choose to use Rademacher complexity:
Easy to work with, has nice properties (see properties below)
Let's you get a nearly tight bound on uniform deviations, even when $\mathbb{H}$ is not finite (will see in future lectures)
Rademacher Complexity for Vectors
A more general form of defining Rademacher complexity is to consider a set of vectors $A\subseteq \mathbb{R}^m$, where the elements $a\in A$ are vectors of the form $a=(a_1,..a_m)\in\mathbb{R}^m$. We then define the Rademacher complexity of $A$ as
$$R(A)=\mathbb{E}_{\sigma}[\sup_{a\in A}\frac{1}{m}\sum_{i=1}^m \sigma_i a_i]$$
Note that this definition is a more general form of the Rademacher complexity for functions since if we have some set $S\sim D^m$ and function class $\mathcal{F}$, we can define a set of vectors $A$ such that each vector $a\in A$ is of the form $a=(f(x_1),..f(x_m))$ for some $f\in\mathcal{F}$. Clearly, we can see that $R(A)= \hat{R_S}(\mathcal{F})$, meaning that the vector definition is more general. As such, we mention some properties below of Rademacher complexity in the vector form, noting that they still hold for the function form by this reasoning.
Basic Properties
Here are a few properties of Rademacher complexity. While we do not prove them here, it is a good exercise to do so.
$R(cA)=|c|R(A)$
If $A'\subseteq A$, then $R(A')\leq R(A)$
Define the Minkowski Sum of two sets of vectors $A, A'$ as $A+A'=\{a+a'|a\in A, a'\in A'\}$. We have $R(A+A')=R(A)+R(A')$
For a set of vectors $A$, define the convex hull of $A$ as $$\text{conv}(A)=\{\sum_{i=1}^{|A|} \lambda_i a^{(i)}|\lambda_i\geq0, \sum_{i=1}^{|A|} \lambda_i = 1\}$$ We have that $R(\text{conv}(A))=R(A)$.
Key Lemmas
In addition to the basic properties above, we have 2 key lemmas regarding Rademacher complexity. These are more complex and difficult to prove.
Massart's Finite Class Lemma: If $A$ is finite, then $$R(A)\leq \max_{a\in A} ||a||_2 \frac{\sqrt{2\log |A|}}{m}$$
Talagrand's Contraction Lemma: Recall that we say a function $\Phi$ is L-Lipchitz when $||\Phi(x)-\Phi(y)||\leq L||x-y||, x, y\in\text{dom}(\Phi)$. Now, if $\Phi$ is from $\mathbb{R}\to\mathbb{R}$, and we have a set of vectors $A\subseteq \mathbb{R}^n$, define $\Phi(A)=\{(\Phi(a_1),...\Phi(a_m)|a\in A\}$. It can be shown that $$R(\Phi(A))\leq L*R(A)$$
Homework
In this problem, we will prove a classical bound on the Rademacher complexity of neural networks. Suppose the input space is $\mathcal{X} = \mathbb{R}^n$ and we have a training set $S=\{(x_i,y_i)\}_{i=1}^m$. Let $\phi:\mathbb{R}\to\mathbb{R}$ be an $L$-Lipschitz activation function such that $\phi(0)=0$ (e.g., the ReLU function). Define the class of neural networks of depth $2\leq j \leq D$ and width $H$ with $\ell_1$-bounded weights recursively as
$$\mathcal{F}_j \coloneqq \left\{ x\mapsto \sum_{k=1}^H w_k \phi(f_k(x)): f_k\in \mathcal{F}_{j-1}, \|w\|_1\leq B_j \right\}.$$
Graded. Define $\mathcal{F}_1 \coloneqq \left\{ x\mapsto \langle w, x \rangle : \|w\|_1\leq B_1 \right\}$ and suppose $\|x_i\|_\infty\leq C$ for all $1\leq i \leq m$. Prove that
$$\widehat{\mathfrak{R}}_S(\mathcal{F}_1) \leq B_1C \sqrt{\frac{2\log 2n}{m}}.$$Hint. Use Holder's inequality and Massart's finite class lemma.
Graded. Prove that $\widehat{\mathfrak{R}}_S(\mathcal{F}_j) \leq 2LB_j \widehat{\mathfrak{R}}_S(\mathcal{F}_{j-1})$ for $2 \leq j \leq D$. Hint. Use Holder's inequality and Talagrand's contraction lemma. You may use part (4) without proof.
Graded. Use parts (1) and (2) to show an upper bound on $\widehat{\mathfrak{R}}_S(\mathcal{F}_D)$. (You must use parts (1) and (2)).
Ungraded, optional. Prove that if a function class $\mathcal{G}$ contains the zero function, then
$$\mathbb{E}_\sigma \left[\sup_{g\in\mathcal{G}} \frac{1}{m} \left| \sum_{i=1}^m \sigma_i g(x_i) \right|\right] \leq 2\widehat{\mathfrak{R}}_S(\mathcal{G}).$$