CS7545_Sp24_Practice_Exam_2 - mltheory/CS7545 GitHub Wiki

Practice Exam 2

Which of the following statements is false? Please choose only one!
1. If a hypothesis class has finite VC-dimension $d$, then its generalization guarantee when using an input space of dimension $n$ depends on the ratio $d/n$.
2. If a hypothesis class has finite VC-dimension, then it must have finite Rademacher complexity.
3. There exists a hypothesis class with infinite VC-dimension but finite Rademacher complexity.
4. If a hypothesis class $\mathcal{H}$ has $|\mathcal{H}|<\infty$, then its Rademacher complexity is upper bounded by a constant times $\sqrt{\log |\mathcal{H}|}$ (ignoring possible other factors that do not include $|\mathcal{H}|$).

Solution. The correct answer is (a). The guarantee should instead depend on $d/m$ where $m$ is the number of data. (b) is true because Rademacher complexity is always finite. (c) is true because Rademacher complexity is always finite. (d) is true by Massart's finite class lemma.

Which of the following statements is true? Please choose only one!
1. The ERM hypothesis attains zero error on the training set.
2. The estimation error of the ERM hypothesis $\hat{h}$ is upper bounded by a constant times a property of the hypothesis class $\mathcal{H}$ which does not involve $\hat{h}$.
3. To minimize the generalization error, one should choose a hypothesis class $\mathcal{H}$ containing a hypothesis that attains zero error on the training set.
4. While the estimation error of ERM is difficult to characterize, we were able to prove bounds on its approximation error.

Solution. The correct answer is (b). It references the bound $L(\hat{h})-L(h^\star)\leq 2\sup_{h\in\mathcal{H}}|L(h)-\widehat{L}_S(h)|$ for an ERM hypothesis $\hat{h}\in\mathcal{H}$ and $h^\star= \arg\min_{h\in\mathcal{H}}L(h)$. (a) is false because the ERM hypothesis minimizes the training error but may not drive it to zero. (c) is false by the bias-complexity tradeoff. (d) is false since we proved bounds on the estimation error but not the approximation error.

Let $A\subseteq \mathbb{R}^m$ and $B\subseteq \mathbb{R}^m$. Define the Minkowski sum of $A$ and $B$ by $A+B=\{a+b: a\in A, b\in B\}$. Prove that $\mathfrak{R}(A+B)=\mathfrak{R}(A)+\mathfrak{R}(B)$.

Solution. We have $$\mathfrak{R}(A+B)=\mathbb{E}_\sigma \left[\sup_{a\in A,b\in B} \frac{1}{m}\sum_{i=1}^m \sigma_i (a_i+b_i)\right]=\mathbb{E}_\sigma \left[\sup_{a\in A,b\in B}\left( \frac{1}{m}\sum_{i=1}^m \sigma_i a_i+\frac{1}{m}\sum_{i=1}^m \sigma_i b_i\right)\right]$$ $$=\mathbb{E}_\sigma \left[\sup_{a\in A}\frac{1}{m}\sum_{i=1}^m \sigma_i a_i+\sup_{b\in B}\frac{1}{m}\sum_{i=1}^m \sigma_i b_i\right]=\mathbb{E}_\sigma \left[\sup_{a\in A} \frac{1}{m}\sum_{i=1}^m \sigma_i a_i\right]+\mathbb{E}_\sigma\left[\sup_{b\in B} \frac{1}{m}\sum_{i=1}^m \sigma_i b_i\right]=\mathfrak{R}(A)+\mathfrak{R}(B).$$

What is the VC-dimension of axis-aligned squares in $\mathbb{R}^2$? An axis-aligned square is a function $h_{x, r}:\mathbb{R}^2\to \mathbb{R}$ parameterized by a lower-left corner $x=(x_1,x_2)\in \mathbb{R}^2$ and side length $r\in \mathbb{R}$, where $h_{x,r}(z)=+1$ when $z\in [x_1, x_1+r]\times [x_2, x_2+r]$ and $h_{x,r}(z)=-1$ otherwise.

Solution. We have $\text{VC}(\mathcal{H})=3$. We can shatter the set $S=\{(0,1), (1,0), (-1,0)\}$ as follows. If no points are labeled $+1$ choose the square $x=(2, 2),r=1$. If one point $s$ is labeled $+1$ choose the square $x=s,r=1/2$. If the first two points are labeled $+1$ choose the square $x=(0,0),r=1$. If the second two points are labeled $+1$ choose the square $x=(-1,0),r=1$. If the outer two points are labeled $+1$ choose the square $x=(-1,-2),r=2$. If all three points are labeled $+1$ choose the square $x=(-1,0),r=2$.

Consider a set $S$ with $|S|=4$. Let $x_{min},x_{max},y_{min},y_{max}$ denote the points with smallest/largest $x,y$ coordinate; assume there are no ties, since if there are ties it is a simpler case. Assume without loss of generality that $y_{max}-y_{min}\geq x_{max}-x_{min}$. Then, we cannot label $x_{min}=-1,x_{max}=-1, y_{min}=+1,y_{max}=+1$, so we cannot shatter any set of four points.

Prove that the Sauer-Shelah lemma is tight in $\mathbb{R}$, i.e., exhibit a hypothesis class $\mathcal{H}\subseteq \{h:\mathbb{R}\to\{-1,1\}\}$ and $d\in\mathbb{R}$ such that $\text{VC}(\mathcal{H})=d$ and $\Pi_\mathcal{H}(m)=\sum\limits_{i=0}^d \binom{m}{i}$. Hint. You can take $d=1$.

Solution. Consider the hypothesis class $\mathcal{H}$ which contains hypotheses $h_x$ that label $x$ as $+1$ and everything else as $-1$. Clearly $\text{VC}(\mathcal{H})=1$. Now choose a set $S=\{x_1,\dots,x_m\}$. There are $m+1$ ways to classify $S$ using points in $\mathcal{H}$, and this holds for any set of $m$ points. Therefore, $\Pi_\mathcal{H}(m)=m+1=\binom{m}{0}+\binom{m}{1}=\sum\limits_{i=0}^{1}\binom{m}{i}$.

Let $\mathcal{F}\subseteq \{f:\mathcal{X}\to [-1,1]\}$ and define a loss function $\ell:\mathbb{R}\to\mathbb{R}$ by $\ell(z)=\frac{1}{2}|z|$. Set $L(f)=\mathbb{E}_{(x,y)\sim \mathcal{D}} [\ell(y-f(x))]$ and $\widehat{L}_S(f)=\frac{1}{m}\sum\limits_{i=1}^m \ell(y_i-f(x_i))$ for a set $S=\{(x_i,y_i)\}_{i=1}^m$ where $y\in\{-1,1\}$. Prove that with probability at least $1-\delta$ over $S\sim \mathcal{D}^m$, for all $f\in\mathcal{F}$, $$L(f)\leq \widehat{L}_S(f)+\mathfrak{R}(\mathcal{F})+\sqrt{\frac{\log 1/\delta}{2m}}.$$ Hint. First, define a class for functions $(x,y)\mapsto y-f(x)$ and compute its Rademacher complexity. Then, use the property that $\ell(\cdot)$ is a $\frac{1}{2}$-Lipschitz function.

Solution. Define $\mathcal{L}=\{(x,y)\mapsto \frac{1}{2}|y-f(x)|:f\in\mathcal{F}\}$ and $\mathcal{G}=\{(x,y)\mapsto y-f(x):f\in\mathcal{F}\}$. We have $$\mathfrak{R}(\mathcal{G})=\mathbb{E}_{S,\sigma}\left[\sup_{g\in\mathcal{G}}\frac{1}{m}\sum_{i=1}^m \sigma_i g(x_i)\right]=\mathbb{E}_{S,\sigma}\left[\sup_{f\in\mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i (y_i-f(x_i))\right]$$ $$=\mathbb{E}_{S,\sigma}\left[\sup_{f\in\mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i y_i -\sigma_i f(x_i)\right]=\mathbb{E}_{S,\sigma}\left[\sup_{f\in\mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i f(x_i))\right]=\mathfrak{R}(\mathcal{F}).$$

Note that $\mathcal{L}=\ell\circ \mathcal{G}$, so by Talagrand's lemma, $\mathfrak{R}(\mathcal{L})\leq \frac{1}{2}\mathfrak{R}(\mathcal{G})=\frac{1}{2}\mathfrak{R}(\mathcal{F})$. The result follows by applying the Rademacher complexity generalization bound on $\mathcal{L}$; we can do this because $\mathcal{F}$ maps to $[-1,1]$ so then $\mathcal{L}$ maps to $[0,1]$ as needed.