CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 09

February 07, 2023

Scribes: Yitong Li, Sachit Kuhar

Rademacher complexity

Let $F=\{f: x \rightarrow \mathbb{R}\}$ and $s=\{(x_i, y_i)\}_{i-1}^{m} \sim D^m$. We assume $\sigma_i$ = $\{1,-1\}$ with equal probability.

Rademacher complexity (Lecture 08) is

$$\hat{\mathbb{R}}_s(F)=\underset{\sigma}{E}\left[\sup _{f \in F} \frac{1}{m} \sum_{i=1}^m \sigma_i f(x_i)\right]$$

Definition for a set of vectors $A \subseteq \mathbb{R}^m$, and $a=\left(a_1, a_2, \ldots, a_m\right)$. The Rademacher complexity of $A$ is:

$$R(A)=\underset{\sigma}{E}\left[\sup _{a \in A} \frac{1}{m} \sum_{i=1}^m \sigma_i a_i\right]$$

We could set $A=\{(f(x_1), \ldots, f(x_m)), f \in F\}$

Why Rademacher complexity?

Nice properties / lemmas "Rademacher calculus"
Nearly tight bound on $\sup _{f \in F}\left|L(f)-\hat{L}_s(f)\right|$

Properties of Rademacher complexity?

$R(c A)=|c| R(A)$ $where \hspace{10pt} cA=\{c*a : a \in A\}$
$A^{\prime} \subseteq A \Rightarrow R(A^{\prime}) \leq R(A)$
$R(A+A^{\prime})=R(A)+R(A^{\prime})$ where Minkowski Sum is $A+A^{\prime}=\{a+a^{\prime}: a \in A, a^{\prime} \in A^{\prime}\}$
$R(\text{Conv}(A))=R(A)$

$\text {where Conv }(.) \text { is convex hull }$ and $\text{Conv}(A)=\{\sum w_i a_i, \quad \sum w_i=1 \text{ and } w_i>=0 \}$.
Massart's Lemma: If $|A|<\infty$, $R(A)\leq\max _{a \in A}\|a\|_2\frac{\sqrt{2 \log |A|}}{m .}$
Talagrand's contraction Lemma: If $\phi: \mathbb{R} \rightarrow \mathbb{R}$ is $L$-Lipschitz, then $R(\phi(A)) \leq L \cdot R(A)$

where $\phi(A)={\phi(a): a \in A}$ and $\phi(a)=(\phi_1(a_1), \phi_2(a_2), \ldots, \phi_m(a_m))$

and the definition of $\phi_i$ that is $L$-Lipschitz in $R$ is $|\phi_i(x)-\phi_i(y)| \leqslant L|x-y|$

Proof of Massart’s Lemma Study $A^{\prime}=\{\lambda a : a \in A\}, \lambda \in \mathbb{R}$

$$\begin{align*} & m R(A^{\prime})={\mathbb{E}}_\sigma\left[\max _{a^{\prime} \in A^{\prime}} \sum_{i=1}^m \sigma_i a_i^{\prime}\right] \\\ & ={\mathbb{E}}_\sigma\left[\log \left(\max _{a^{\prime} \in A^{\prime}} \exp (\sum_{i=1}^m \sigma_i a_i^{\prime})\right)\right] \\\ & \leq {\mathbb{E}}_\sigma\left[\log \left(\sum_{a^{\prime} \in A^{\prime}} \exp (\sum_{i=1}^m \sigma_i a_i^{\prime})\right)\right] \\\ \end{align*}$$

By Jensen's inequality, we have:

$$\begin{align*} & \leqslant \log \left(\mathbb{E}_\sigma\left[\sum_{a^{\prime} \in A^{\prime}} \exp (\sum_{i=1}^m \sigma_i a_i^{\prime})\right]\right) \\\ & =\log \left( \mathbb{E}_\sigma\left[\sum_{a^{\prime} \in A^{\prime}} \prod_{i=1}^m \exp (\sigma_i a_i^{\prime})\right]\right) \\\ & =\log \left( \sum_{a^{\prime} \in A^{\prime}} \prod_{i=1}^m \mathbb{E}_{\sigma_i}\left[\exp (\sigma_i a_i^{\prime})\right]\right) \\\ \end{align*}$$

We know that:

$$\begin{align*} & \mathbb{E}_{\sigma_i}\left[\exp \left(\sigma_i a_i^{\prime}\right)\right]=\frac{1}{2} \exp \left(a_i^{\prime}\right)+\frac{1}{2} \exp \left(-a_i^{\prime}\right) \leq \exp \left(\frac{a_i^{\prime 2}}{2}\right) \\\ \end{align*}$$

Hence,

$$\begin{align*} & m R\left(A^{\prime}\right)\leq\log \left(\sum_{a^{\prime} \in A^{\prime}} \prod_{i=1}^m \exp \left(\frac{a_i^{\prime 2}}{2}\right)\right) \\\ & =\log \left(\sum_{a^{\prime} \in A^{\prime}} \exp \left(\frac{\left\|a^{\prime}\right\|_2^2}{2}\right)\right) \\\ & \leqslant \log \left(\left|A^{\prime}\right| \max _{a^{\prime} \in A^{\prime}} \exp \left(\frac{\left\|a^{\prime}\right\|_2^2}{2}\right)\right) \\\ & =\log \left|A^{\prime}\right|+\log \left(\max _{a^{\prime} \in A^{\prime}}\left(\exp \left(\frac{\left\|a^{\prime}\right\|_2^2}{2}\right)\right)\right). \\\ & =\log \left|A^{\prime}\right|+\max _{a^{\prime} \in A^{\prime}} \frac{\left\|a^{\prime}\right\|_2^2}{2} \end{align*}$$

By property 1, we have $R(A) =\frac{1}{\lambda} R\left(A^{\prime}\right)$. Therefore, we have

$$\begin{align*} R(A) & \leq \frac{\log \left|A^{\prime}\right|+\lambda^2 \max_{a \in A} \frac{\|a\|_2^2}{2}}{\lambda m} && \text{Substitute } \lambda a \text{ for } a^\prime \\\ \text{Pick } & \lambda=\sqrt{\frac{2 \log |A|}{\max\|a\|^2}} \\\ R(A) & \leq \max_{a \in A}\|a\|_2 \frac{\sqrt{2 \log |A|}}{m .} \end{align*}$$

Proof of Talagrand's Contraction Lemma

It is sufficient to prove for $L=1$. Define $\phi^{\prime}=\frac{1}{L} \phi$. Then, we have $R(\phi(A)) =R(L \phi^{\prime}(A))$. Use Property 1, it follows that

$$ \begin{align*} R(\phi(A)) &=R(L \phi^{\prime}(A))\\ &=L R(\phi^{\prime}(A)) . \end{align*} $$

It is also sufficient to show for $A_1$,

$$ \begin{align*} & A_{1}={(\phi(a_{1}), a_{2}, \ldots, a_{m}): a \in A} \\ & \phi_{1}, \phi_{2}, \ldots, \phi_{m}: \mathbb{R} \rightarrow \mathbb{R} \\ & \phi (a)= (\phi_{1} (a_{1} ), \phi_{2} (a_{2} ), \ldots, \phi_{m} (a_{m} ) ) \\ & \phi_{1}=\phi \\ & \phi_{i}=\text {Identity, } 2 \leqslant i \leqslant m \end{align*} $$

Observe that

$$ \begin{align*} mR (A_{1} )&=\mathbb{E}_{\sigma} \left[\sup _{a\in A_{1}}\sum_{i=1}^{m} \sigma_{i} a_{i} \right] \\\ &=\mathbb{E}_{\sigma} \left[\sup _{a\in A}(\sigma_1\phi(a_1)+\sum_{i=2}^{m} \sigma_{i} a_{i} )\right] \\\ & =\frac{1}{2} \mathbb{E} _{\sigma_{2}, \dots,\sigma_{m}}\left[\sup _{a\in A}(\phi (a_{1} )+\sum_{i=2}^{m} \sigma_{i} a_{i} ) +\sup _{a \in A} (-\phi (a_{1} )+\sum_{i=2}^{m} \sigma_{i} a_{i} ) \right] && \sigma_1 \text{ has probability } \frac{1}{2} \text{ of being } +1 \text{ or } -1 \\\ & =\frac{1}{2} \mathbb{E} _{\sigma_{2}, \dots,\sigma_{m}}\left[\sup _{a,a'\in A}(\phi(a_1)-\phi(a'_1)+\sum_{i=2}^{m} \sigma_{i} a_{i}+\sum_{i=2}^{m} \sigma_{i} a'_{i} ) \right] \end{align*} $$

Since $\phi$ is 1-Lipschitz, we have $\phi (a_{1})-\phi (a_{1}^{\prime})\leqslant a_{1}-a_{1}^{\prime}$. So, we get

$$ \begin{align*} mR (A_{1} )&\leqslant \frac{1}{2} \mathbb{E}_{\sigma_{2}, \dots,\sigma_{m}} \left[\sup _ { a, a ^ { \prime } \in A } (a_{1}-a_{1}^{\prime}+\sum_{i=2}^{m} \sigma_{i} a_{i}^{\prime} +\sum_{i=2}^{m} \sigma_{i} a_{i})\right]\\\ &=\frac{1}{2}\mathbb{E}_{\sigma_{2},\dots,\sigma_{m}} \left[\sup _{a \in A} (a_{1}+\sum_{i=2}^{m} \sigma_{i} a_{i})+\sup _{a \in A} (-a_{1}+\sum_{i=2}^{m} \sigma_{i} a_{i})\right] .\\\ &=\mathbb{E}_{\sigma} \left[\sup _{a \in A} (\sum_{i=1}^{m} \sigma_{i} a_{i})\right]\\\ &=mR(A). \end{align*} $$

CS7545_Sp23_Lecture_09: Rademacher Complexity - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 09

February 07, 2023

Scribes: Yitong Li, Sachit Kuhar

Rademacher complexity

Why Rademacher complexity?

Properties of Rademacher complexity?

⚠️ GitHub.com Fallback ⚠️

CS7545_Sp23_Lecture_09: Rademacher Complexity - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 09

February 07, 2023

Scribes: Yitong Li, Sachit Kuhar

Rademacher complexity

Why Rademacher complexity?

Properties of Rademacher complexity?

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️