CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 09

February 13, 2024

Scribes: Vedaant Shah, Jerry Xiong

Rademacher Complexity: Setting And Review

Recall that in our setting for defining Rademacher complexity from last time, we are working with an input space $X$, distribution $D$ over $X$, and a function class $\mathcal{F}\subseteq \{f:X\to\mathbb{R}\}$. One way (not mentioned last time) to interpret $\mathcal{F}$ is to view it as representing a class of loss functions, in which we wish to quantify how complex the our loss function class is using Rademacher complexity.

We define a Rademacher random variable as $\sigma_i$ which takes values 1 and -1 with probability 0.5 each. A Rademacher random vector is defined as $\sigma=(\sigma_1,...\sigma_m)$ where each $\sigma_i$ is an independent Rademacher random variable. From here, we say the empirical Rademacher complexity of $\mathcal{F}$ with respect to some iid sample $S=\{x_1,...x_m\}\sim D^m$ is $$\hat{R_S}(\mathcal{F})=\mathbb{E}_{\sigma} [\sup_{f \in \mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i f(x_i)]$$ Likewise, the general Rademacher complexity of $\mathcal{F}$ on $X$ as $R(F)=\mathbb{E}_{S\sim D^m}[\hat{R_S}(\mathcal{F})]$.

Today, we will use this same setting to analyze some examples and properties of Rademacher complexity.

Rademacher Complexity: Basic Examples

Now, let us consider some basic examples of Rademacher complexity.

As the first example, let $X\subseteq \mathcal{R}, S=\{0,1\}, \mathcal{F}=\{f\}$ where $f(x)$ is defined as $1$ when $x\geq 1$ and $-1$ otherwise. We have $$\hat{R_S}(\mathcal{F})=\mathbb{E}_{\sigma}[\sup_{f \in \mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i f(x_i)]=\frac{1}{2}\mathbb{E}_{\sigma}[\sum_{i=1}^2 \sigma_i f(x_i)]$$ since $m=2$ and $\mathcal{F}$ only contains $f$, allowing us to remove the sup. From here, note that $\sigma$ can be either $(-1, -1), (-1, 1), (1, -1), (1, 1)$, each with probability $\frac{1}{4}$, allowing us to expand the expectation to get $$\frac{1}{2}(\frac{1}{4}(-1f(0)-1f(1))+\frac{1}{4}(-1f(0)+1f(1))+\frac{1}{4}(1f(0)-1f(1))+\frac{1}{4}(1f(0)+1f(1)))=$$ $$\frac{1}{2}*\frac{1}{4}(-1*(-1)-1*1-1*(-1)+1*1+1*-1-1*1+1*(-1)+1*1)=\frac{1}{8}(1-1+1+1-1-1-1+1)=\frac{1}{8}(0)=0$$

As a remark, it can also be shown that the Rademacher complexity will always be 0 if $\mathcal{F}$ contains only 1 function, regardless of what that function is. Additionally, the Rademacher complexity will always be lower bounded by 0, which is thus achieved when $\mathcal{F}$ contains only 1 function.

As another example, consider a function $g$ where $g(x)=1$ when $x\geq 0$ and $-1$ otherwise. Now, let $G=\{f, g\}$ and $S=\{0,1\}$ We have: $$\hat{R_S}(\mathcal{G})=\mathbb{E}_{\sigma}[\sup_{h \in \mathcal{G}}\frac{1}{m}\sum_{i=1}^m \sigma_i h(x_i)]=\frac{1}{8}\sum_{\sigma}\sup_{h\in G} \sum_{i=1}^2 \sigma_ih(x_i)$$ where the $\frac{1}{8}$ comes from the same logic as above since $m=2$ and each choice of $\sigma$ has probability $\frac{1}{4}$. From here, note that:

$$\begin{align*} \sigma &=(-1, -1)\implies \sup_{h\in G} \sum_{i=1}^2 \sigma_ih(x_i)=\max(-1f(0)-1f(1),-1g(0)-1g(1))=\max(1-1, -1-1)=0 \\\ \sigma &=(-1, 1)\implies \sup_{h\in G} \sum_{i=1}^2 \sigma_ih(x_i)=\max(-1f(0)+1f(1),-1g(0)+1g(1))=\max(1+1, -1+1)=2 \\\ \sigma &=(1, -1)\implies \sup_{h\in G} \sum_{i=1}^2 \sigma_ih(x_i)=\max(1f(0)-1f(1),1g(0)-1g(1))=\max(-1-1, 1-1)=0 \\\ \sigma &=(1, 1)\implies \sup_{h\in G} \sum_{i=1}^2 \sigma_ih(x_i)=\max(1f(0)+1f(1),1g(0)+1g(1))=\max(-1+1, 1+1)=2 \end{align*}$$

Substituting into above, we thus get $$\hat{R_S}(\mathcal{G})= \frac{1}{8}(0+2+0+2)=\frac{1}{2}$$ As another remark, note that the we increased the size of our function class by adding $g$ to $\mathcal{F}$ to get $\mathcal{G}$ and observed an increase in the Rademacher complexity, which intuitively makes sense since we can are taking the sup inside the expectation over a superset of $\mathcal{F}$, so we should be able to produce a value at least as large as before.

Rademacher Complexity: Linear Classifiers

Suppose the function class is the space of linear classifiers: $$\mathcal{F} = \{ x \to \langle w, x \rangle : | w |_2 \le B\}$$ and $S$ is $$S = \{x_1, \ldots, x_m : \forall i, | x_i | \le C \}.$$ Then, the Empirical Rademacher complexity is bounded by $$\mathfrak{R}_S(\mathcal{F}) \le \frac{BC}{\sqrt m}.$$

Proof:

$$\begin{align*} \mathfrak{R}_S (\mathcal{F}) &= \mathbb{E}_{\sigma} \sup_{f \in \mathcal{F}} \frac{1}{m} \sum_{i=1}^m \sigma_i f(x_i) \\\ &= \frac 1m \mathbb{E}_{\sigma} \sup_{w, \| w \|_2 \le B} \sum_{i=1}^m \sigma_i \langle w, x_i \rangle \\\ &= \frac 1m \mathbb{E}_{\sigma} \sup_{w, \| w \|_2 \le B} \langle w, \sum_{i=1}^m \sigma_i x_i \rangle \\\ &\le \frac 1m \mathbb{E}_{\sigma} \sup_{w, \| w \|_2 \le B} \| w \|_2 \| \sum_{i=1}^m \sigma_i x_i \|_2 \quad\text{by Cauchy-Schwarz} \\\ &\le \frac 1m \mathbb{E}_{\sigma} B \| \sum_{i=1}^m \sigma_i x_i \|_2 \quad\text{since norm bounded by \$B\$} \\\ &= \frac Bm \mathbb{E}_{\sigma} \| \sum_{i=1}^m \sigma_i x_i \|_2] \\\ &= \frac Bm \mathbb{E}_{\sigma} \sqrt{ \| \sum_{i=1}^m \sigma_i x_i \|_2^2 } \quad\text{squaring and sqrting inside} \\\ &\le \frac Bm \sqrt{ \mathbb{E}_{\sigma} \| \sum_{i=1}^m \sigma_i x_i \|_2^2 } \quad\text{by Jensen's inequality: square root is concave} \\\ &= \frac Bm \sqrt{ \mathbb{E}_{\sigma} \left\langle \sum_{i=1}^m \sigma_i x_i, \sum_{i=1}^m \sigma_i x_i \right\rangle } \quad\text{by def. of norm} \\\ &= \frac Bm \sqrt{ \mathbb{E}_{\sigma} \sum_{i, j \in [m]} \langle \sigma_i x_i, \sigma_j x_j \rangle } \\\ &= \frac Bm \sqrt{ \sum_{i, j \in [m]} \mathbb{E}_{\sigma} \langle \sigma_i x_i, \sigma_j x_j \rangle } \\\ &= \frac Bm \sqrt{ \sum_{i \in [m]} \mathbb{E}_{\sigma} \sigma_i^2 \langle x_i, x_i \rangle + \sum_{i, j \in [m], i \ne j} \mathbb{E}_{\sigma} \sigma_i \sigma_j \langle x_i, x_j \rangle } \\\ &= \frac Bm \sqrt{ \sum_{i \in [m]} \langle x_i, x_i \rangle \mathbb{E}_{\sigma} \sigma_i^2 + \sum_{i, j \in [m], i \ne j} \langle x_i, x_j \rangle \mathbb{E}_{\sigma_i} [\sigma_i] \mathbb{E}_{\sigma_j} [\sigma_j] } \quad\text{\$\sigma_i, \sigma_j\$ are independent for \$i \ne j\$} \\\ &= \frac Bm \sqrt{ \sum_{i \in [m]} \langle x_i, x_i \rangle} \quad\text{since \$\mathbb E[\sigma_i] = 0\$ and \$\sigma_i^2\$ is always 1} \\\ &= \frac Bm \sqrt{ mC^2} \quad\text{since \$\| x_i \|_2 \le C\$} \\\ &= \frac{BC}{\sqrt m} \end{align*}$$

Why Use Rademacher Complexity?

Although we have seen some examples, you may ask why Rademacher complexity is useful for quantifying the "complexity" of function classes. Below are the two main reasons that we choose to use Rademacher complexity:

Easy to work with, has nice properties (see properties below)
Let's you get a nearly tight bound on uniform deviations, even when $\mathbb{H}$ is not finite (will see in future lectures)

Rademacher Complexity for Vectors

A more general form of defining Rademacher complexity is to consider a set of vectors $A\subseteq \mathbb{R}^m$, where the elements $a\in A$ are vectors of the form $a=(a_1,..a_m)\in\mathbb{R}^m$. We then define the Rademacher complexity of $A$ as $$R(A)=\mathbb{E}_{\sigma}[\sup_{a\in A}\frac{1}{m}\sum_{i=1}^m \sigma_i a_i]$$

Note that this definition is a more general form of the Rademacher complexity for functions since if we have some set $S\sim D^m$ and function class $\mathcal{F}$, we can define a set of vectors $A$ such that each vector $a\in A$ is of the form $a=(f(x_1),..f(x_m))$ for some $f\in\mathcal{F}$. Clearly, we can see that $R(A)= \hat{R_S}(\mathcal{F})$, meaning that the vector definition is more general. As such, we mention some properties below of Rademacher complexity in the vector form, noting that they still hold for the function form by this reasoning.

Basic Properties

Here are a few properties of Rademacher complexity. While we do not prove them here, it is a good exercise to do so.

$R(cA)=|c|R(A)$
If $A'\subseteq A$, then $R(A')\leq R(A)$
Define the Minkowski Sum of two sets of vectors $A, A'$ as $A+A'=\{a+a'|a\in A, a'\in A'\}$. We have $R(A+A')=R(A)+R(A')$
For a set of vectors $A$, define the convex hull of $A$ as $$\text{conv}(A)=\{\sum_{i=1}^{|A|} \lambda_i a^{(i)}|\lambda_i\geq0, \sum_{i=1}^{|A|} \lambda_i = 1\}$$ We have that $R(\text{conv}(A))=R(A)$.

Key Lemmas

In addition to the basic properties above, we have 2 key lemmas regarding Rademacher complexity. These are more complex and difficult to prove.

Massart's Finite Class Lemma: If $A$ is finite, then $$R(A)\leq \max_{a\in A} ||a||_2 \frac{\sqrt{2\log |A|}}{m}$$
Talagrand's Contraction Lemma: Recall that we say a function $\Phi$ is L-Lipchitz when $||\Phi(x)-\Phi(y)||\leq L||x-y||, x, y\in\text{dom}(\Phi)$. Now, if $\Phi$ is from $\mathbb{R}\to\mathbb{R}$, and we have a set of vectors $A\subseteq \mathbb{R}^n$, define $\Phi(A)=\{(\Phi(a_1),...\Phi(a_m)|a\in A\}$. It can be shown that $$R(\Phi(A))\leq L*R(A)$$

Homework

In this problem, we will prove a classical bound on the Rademacher complexity of neural networks. Suppose the input space is $\mathcal{X} = \mathbb{R}^n$ and we have a training set $S=\{(x_i,y_i)\}_{i=1}^m$. Let $\phi:\mathbb{R}\to\mathbb{R}$ be an $L$-Lipschitz activation function such that $\phi(0)=0$ (e.g., the ReLU function). Define the class of neural networks of depth $2\leq j \leq D$ and width $H$ with $\ell_1$-bounded weights recursively as $$\mathcal{F}_j \coloneqq \left\{ x\mapsto \sum_{k=1}^H w_k \phi(f_k(x)): f_k\in \mathcal{F}_{j-1}, \|w\|_1\leq B_j \right\}.$$

Graded. Define $\mathcal{F}_1 \coloneqq \left\{ x\mapsto \langle w, x \rangle : \|w\|_1\leq B_1 \right\}$ and suppose $\|x_i\|_\infty\leq C$ for all $1\leq i \leq m$. Prove that $$\widehat{\mathfrak{R}}_S(\mathcal{F}_1) \leq B_1C \sqrt{\frac{2\log 2n}{m}}.$$ Hint. Use Holder's inequality and Massart's finite class lemma.
Graded. Prove that $\widehat{\mathfrak{R}}_S(\mathcal{F}_j) \leq 2LB_j \widehat{\mathfrak{R}}_S(\mathcal{F}_{j-1})$ for $2 \leq j \leq D$. Hint. Use Holder's inequality and Talagrand's contraction lemma. You may use part (4) without proof.
Graded. Use parts (1) and (2) to show an upper bound on $\widehat{\mathfrak{R}}_S(\mathcal{F}_D)$. (You must use parts (1) and (2)).
Ungraded, optional. Prove that if a function class $\mathcal{G}$ contains the zero function, then $$\mathbb{E}_\sigma \left[\sup_{g\in\mathcal{G}} \frac{1}{m} \left| \sum_{i=1}^m \sigma_i g(x_i) \right|\right] \leq 2\widehat{\mathfrak{R}}_S(\mathcal{G}).$$

CS7545_Sp24_Lecture_09: Rademacher Complexity - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 09

February 13, 2024

Scribes: Vedaant Shah, Jerry Xiong

Rademacher Complexity: Setting And Review

Rademacher Complexity: Basic Examples

Rademacher Complexity: Linear Classifiers

Proof:

Why Use Rademacher Complexity?

Rademacher Complexity for Vectors

Basic Properties

Key Lemmas

Homework

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp24_Lecture_09: Rademacher Complexity - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 09

February 13, 2024

Scribes: Vedaant Shah, Jerry Xiong

Rademacher Complexity: Setting And Review

Rademacher Complexity: Basic Examples

Rademacher Complexity: Linear Classifiers

Proof:

Why Use Rademacher Complexity?

Rademacher Complexity for Vectors

Basic Properties

Key Lemmas

Homework

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️