CS 7545: Machine Learning Theory -- Spring 2023
Instructor: Tyler Labonte
Scribes: Yitong Li, Sachit Kuhar
Let $F=\{f: x \rightarrow \mathbb{R}\}$ and $s=\{(x_i, y_i)\}_{i-1}^{m} \sim D^m$ . We assume $\sigma_i$ = $\{1,-1\}$ with equal probability.
Rademacher complexity (Lecture 08 ) is
$$\hat{\mathbb{R}}_s(F)=\underset{\sigma}{E}\left[\sup _{f \in F} \frac{1}{m} \sum_{i=1}^m \sigma_i f(x_i)\right]$$
Definition for a set of vectors $A \subseteq \mathbb{R}^m$ , and
$a=\left(a_1, a_2, \ldots, a_m\right)$ .
The Rademacher complexity of $A$ is:
$$R(A)=\underset{\sigma}{E}\left[\sup _{a \in A} \frac{1}{m} \sum_{i=1}^m \sigma_i a_i\right]$$
We could set $A=\{(f(x_1), \ldots, f(x_m)), f \in F\}$
Why Rademacher complexity?
Nice properties / lemmas "Rademacher calculus"
Nearly tight bound on
$\sup _{f \in F}\left|L(f)-\hat{L}_s(f)\right|$
Properties of Rademacher complexity?
$R(c A)=|c| R(A)$ $where \hspace{10pt} cA=\{c*a : a \in A\}$
$A^{\prime} \subseteq A \Rightarrow R(A^{\prime}) \leq R(A)$
$R(A+A^{\prime})=R(A)+R(A^{\prime})$
where Minkowski Sum is
$A+A^{\prime}=\{a+a^{\prime}: a \in A, a^{\prime} \in A^{\prime}\}$
$R(\text{Conv}(A))=R(A)$
$\text {where Conv }(.) \text { is convex hull }$ and
$\text{Conv}(A)=\{\sum w_i a_i, \quad \sum w_i=1 \text{ and } w_i>=0 \}$ .
Massart's Lemma: If $|A|<\infty$ ,
$R(A)\leq\max _{a \in A}\|a\|_2\frac{\sqrt{2 \log |A|}}{m .}$
Talagrand's contraction Lemma: If
$\phi: \mathbb{R} \rightarrow \mathbb{R}$ is $L$ -Lipschitz, then
$R(\phi(A)) \leq L \cdot R(A)$
where $\phi(A)={\phi(a): a \in A}$ and
$\phi(a)=(\phi_1(a_1), \phi_2(a_2), \ldots, \phi_m(a_m))$
and the definition of $\phi_i$ that is $L$ -Lipschitz in
$R$ is $|\phi_i(x)-\phi_i(y)| \leqslant L|x-y|$
Proof of Massart’s Lemma
Study $A^{\prime}=\{\lambda a : a \in A\}, \lambda \in \mathbb{R}$
$$\begin{align*}
& m R(A^{\prime})={\mathbb{E}}_\sigma\left[\max _{a^{\prime} \in A^{\prime}} \sum_{i=1}^m \sigma_i a_i^{\prime}\right] \\\
& ={\mathbb{E}}_\sigma\left[\log \left(\max _{a^{\prime} \in A^{\prime}} \exp (\sum_{i=1}^m \sigma_i a_i^{\prime})\right)\right] \\\
& \leq {\mathbb{E}}_\sigma\left[\log \left(\sum_{a^{\prime} \in A^{\prime}} \exp (\sum_{i=1}^m \sigma_i a_i^{\prime})\right)\right] \\\
\end{align*}$$
By Jensen's inequality, we have:
$$\begin{align*}
& \leqslant \log \left(\mathbb{E}_\sigma\left[\sum_{a^{\prime} \in A^{\prime}} \exp (\sum_{i=1}^m \sigma_i a_i^{\prime})\right]\right) \\\
& =\log \left( \mathbb{E}_\sigma\left[\sum_{a^{\prime} \in A^{\prime}} \prod_{i=1}^m \exp (\sigma_i a_i^{\prime})\right]\right) \\\
& =\log \left( \sum_{a^{\prime} \in A^{\prime}} \prod_{i=1}^m \mathbb{E}_{\sigma_i}\left[\exp (\sigma_i a_i^{\prime})\right]\right) \\\
\end{align*}$$
We know that:
$$\begin{align*}
& \mathbb{E}_{\sigma_i}\left[\exp \left(\sigma_i a_i^{\prime}\right)\right]=\frac{1}{2} \exp \left(a_i^{\prime}\right)+\frac{1}{2} \exp \left(-a_i^{\prime}\right) \leq \exp \left(\frac{a_i^{\prime 2}}{2}\right) \\\
\end{align*}$$
Hence,
$$\begin{align*}
& m R\left(A^{\prime}\right)\leq\log \left(\sum_{a^{\prime} \in A^{\prime}} \prod_{i=1}^m \exp \left(\frac{a_i^{\prime 2}}{2}\right)\right) \\\
& =\log \left(\sum_{a^{\prime} \in A^{\prime}} \exp \left(\frac{\left\|a^{\prime}\right\|_2^2}{2}\right)\right) \\\
& \leqslant \log \left(\left|A^{\prime}\right| \max _{a^{\prime} \in A^{\prime}} \exp \left(\frac{\left\|a^{\prime}\right\|_2^2}{2}\right)\right) \\\
& =\log \left|A^{\prime}\right|+\log \left(\max _{a^{\prime} \in A^{\prime}}\left(\exp \left(\frac{\left\|a^{\prime}\right\|_2^2}{2}\right)\right)\right). \\\
& =\log \left|A^{\prime}\right|+\max _{a^{\prime} \in A^{\prime}} \frac{\left\|a^{\prime}\right\|_2^2}{2}
\end{align*}$$
By property 1, we have
$R(A) =\frac{1}{\lambda} R\left(A^{\prime}\right)$ . Therefore, we have
$$\begin{align*}
R(A) & \leq \frac{\log \left|A^{\prime}\right|+\lambda^2 \max_{a \in A} \frac{\|a\|_2^2}{2}}{\lambda m} && \text{Substitute } \lambda a \text{ for } a^\prime \\\
\text{Pick } & \lambda=\sqrt{\frac{2 \log |A|}{\max\|a\|^2}} \\\
R(A) & \leq \max_{a \in A}\|a\|_2 \frac{\sqrt{2 \log |A|}}{m .}
\end{align*}$$
Proof of Talagrand's Contraction Lemma
It is sufficient to prove for $L=1$ . Define $\phi^{\prime}=\frac{1}{L} \phi$ . Then, we have $R(\phi(A)) =R(L \phi^{\prime}(A))$ . Use Property 1, it follows that
$$
\begin{align*}
R(\phi(A)) &=R(L \phi^{\prime}(A))\\
&=L R(\phi^{\prime}(A)) .
\end{align*}
$$
It is also sufficient to show for $A_1$ ,
$$
\begin{align*}
& A_{1}={(\phi(a_{1}), a_{2}, \ldots, a_{m}): a \in A} \\
& \phi_{1}, \phi_{2}, \ldots, \phi_{m}: \mathbb{R} \rightarrow \mathbb{R} \\
& \phi (a)= (\phi_{1} (a_{1} ), \phi_{2} (a_{2} ), \ldots, \phi_{m} (a_{m} ) ) \\
& \phi_{1}=\phi \\
& \phi_{i}=\text {Identity, } 2 \leqslant i \leqslant m
\end{align*}
$$
Observe that
$$
\begin{align*}
mR (A_{1} )&=\mathbb{E}_{\sigma} \left[\sup _{a\in A_{1}}\sum_{i=1}^{m} \sigma_{i} a_{i} \right] \\\
&=\mathbb{E}_{\sigma} \left[\sup _{a\in A}(\sigma_1\phi(a_1)+\sum_{i=2}^{m} \sigma_{i} a_{i} )\right] \\\
& =\frac{1}{2} \mathbb{E} _{\sigma_{2}, \dots,\sigma_{m}}\left[\sup _{a\in A}(\phi (a_{1} )+\sum_{i=2}^{m} \sigma_{i} a_{i} ) +\sup _{a \in A} (-\phi (a_{1} )+\sum_{i=2}^{m} \sigma_{i} a_{i} ) \right] && \sigma_1 \text{ has probability } \frac{1}{2} \text{ of being } +1 \text{ or } -1 \\\
& =\frac{1}{2} \mathbb{E} _{\sigma_{2}, \dots,\sigma_{m}}\left[\sup _{a,a'\in A}(\phi(a_1)-\phi(a'_1)+\sum_{i=2}^{m} \sigma_{i} a_{i}+\sum_{i=2}^{m} \sigma_{i} a'_{i} ) \right]
\end{align*}
$$
Since $\phi$ is 1-Lipschitz, we have $\phi (a_{1})-\phi (a_{1}^{\prime})\leqslant a_{1}-a_{1}^{\prime}$ . So, we get
$$
\begin{align*}
mR (A_{1} )&\leqslant \frac{1}{2} \mathbb{E}_{\sigma_{2}, \dots,\sigma_{m}} \left[\sup _ { a, a ^ { \prime } \in A } (a_{1}-a_{1}^{\prime}+\sum_{i=2}^{m} \sigma_{i} a_{i}^{\prime} +\sum_{i=2}^{m} \sigma_{i} a_{i})\right]\\\
&=\frac{1}{2}\mathbb{E}_{\sigma_{2},\dots,\sigma_{m}} \left[\sup _{a \in A} (a_{1}+\sum_{i=2}^{m} \sigma_{i} a_{i})+\sup _{a \in A} (-a_{1}+\sum_{i=2}^{m} \sigma_{i} a_{i})\right] .\\\
&=\mathbb{E}_{\sigma} \left[\sup _{a \in A} (\sum_{i=1}^{m} \sigma_{i} a_{i})\right]\\\
&=mR(A).
\end{align*}
$$