CS7545_Sp24_Practice_Exam_2 - mltheory/CS7545 GitHub Wiki
Practice Exam 2
- Which of the following statements is false? Please choose only one!
- If a hypothesis class has finite VC-dimension $d$, then its generalization guarantee when using an input space of dimension $n$ depends on the ratio $d/n$.
- If a hypothesis class has finite VC-dimension, then it must have finite Rademacher complexity.
- There exists a hypothesis class with infinite VC-dimension but finite Rademacher complexity.
- If a hypothesis class $\mathcal{H}$ has $|\mathcal{H}|<\infty$, then its Rademacher complexity is upper bounded by a constant times $\sqrt{\log |\mathcal{H}|}$ (ignoring possible other factors that do not include $|\mathcal{H}|$).
Solution. The correct answer is (a). The guarantee should instead depend on $d/m$ where $m$ is the number of data. (b) is true because Rademacher complexity is always finite. (c) is true because Rademacher complexity is always finite. (d) is true by Massart's finite class lemma.
- Which of the following statements is true? Please choose only one!
- The ERM hypothesis attains zero error on the training set.
- The estimation error of the ERM hypothesis $\hat{h}$ is upper bounded by a constant times a property of the hypothesis class $\mathcal{H}$ which does not involve $\hat{h}$.
- To minimize the generalization error, one should choose a hypothesis class $\mathcal{H}$ containing a hypothesis that attains zero error on the training set.
- While the estimation error of ERM is difficult to characterize, we were able to prove bounds on its approximation error.
Solution. The correct answer is (b). It references the bound $L(\hat{h})-L(h^\star)\leq 2\sup_{h\in\mathcal{H}}|L(h)-\widehat{L}_S(h)|$ for an ERM hypothesis $\hat{h}\in\mathcal{H}$ and $h^\star= \arg\min_{h\in\mathcal{H}}L(h)$. (a) is false because the ERM hypothesis minimizes the training error but may not drive it to zero. (c) is false by the bias-complexity tradeoff. (d) is false since we proved bounds on the estimation error but not the approximation error.
- Let $A\subseteq \mathbb{R}^m$ and $B\subseteq \mathbb{R}^m$. Define the Minkowski sum of $A$ and $B$ by $A+B=\{a+b: a\in A, b\in B\}$. Prove that $\mathfrak{R}(A+B)=\mathfrak{R}(A)+\mathfrak{R}(B)$.
Solution. We have $$\mathfrak{R}(A+B)=\mathbb{E}_\sigma \left[\sup_{a\in A,b\in B} \frac{1}{m}\sum_{i=1}^m \sigma_i (a_i+b_i)\right]=\mathbb{E}_\sigma \left[\sup_{a\in A,b\in B}\left( \frac{1}{m}\sum_{i=1}^m \sigma_i a_i+\frac{1}{m}\sum_{i=1}^m \sigma_i b_i\right)\right]$$ $$=\mathbb{E}_\sigma \left[\sup_{a\in A}\frac{1}{m}\sum_{i=1}^m \sigma_i a_i+\sup_{b\in B}\frac{1}{m}\sum_{i=1}^m \sigma_i b_i\right]=\mathbb{E}_\sigma \left[\sup_{a\in A} \frac{1}{m}\sum_{i=1}^m \sigma_i a_i\right]+\mathbb{E}_\sigma\left[\sup_{b\in B} \frac{1}{m}\sum_{i=1}^m \sigma_i b_i\right]=\mathfrak{R}(A)+\mathfrak{R}(B).$$
- What is the VC-dimension of axis-aligned squares in $\mathbb{R}^2$? An axis-aligned square is a function $h_{x, r}:\mathbb{R}^2\to \mathbb{R}$ parameterized by a lower-left corner $x=(x_1,x_2)\in \mathbb{R}^2$ and side length $r\in \mathbb{R}$, where $h_{x,r}(z)=+1$ when $z\in [x_1, x_1+r]\times [x_2, x_2+r]$ and $h_{x,r}(z)=-1$ otherwise.
Solution. We have $\text{VC}(\mathcal{H})=3$. We can shatter the set $S=\{(0,1), (1,0), (-1,0)\}$ as follows. If no points are labeled $+1$ choose the square $x=(2, 2),r=1$. If one point $s$ is labeled $+1$ choose the square $x=s,r=1/2$. If the first two points are labeled $+1$ choose the square $x=(0,0),r=1$. If the second two points are labeled $+1$ choose the square $x=(-1,0),r=1$. If the outer two points are labeled $+1$ choose the square $x=(-1,-2),r=2$. If all three points are labeled $+1$ choose the square $x=(-1,0),r=2$.
Consider a set $S$ with $|S|=4$. Let $x_{min},x_{max},y_{min},y_{max}$ denote the points with smallest/largest $x,y$ coordinate; assume there are no ties, since if there are ties it is a simpler case. Assume without loss of generality that $y_{max}-y_{min}\geq x_{max}-x_{min}$. Then, we cannot label $x_{min}=-1,x_{max}=-1, y_{min}=+1,y_{max}=+1$, so we cannot shatter any set of four points.
- Prove that the Sauer-Shelah lemma is tight in $\mathbb{R}$, i.e., exhibit a hypothesis class $\mathcal{H}\subseteq \{h:\mathbb{R}\to\{-1,1\}\}$ and $d\in\mathbb{R}$ such that $\text{VC}(\mathcal{H})=d$ and $\Pi_\mathcal{H}(m)=\sum\limits_{i=0}^d \binom{m}{i}$. Hint. You can take $d=1$.
Solution. Consider the hypothesis class $\mathcal{H}$ which contains hypotheses $h_x$ that label $x$ as $+1$ and everything else as $-1$. Clearly $\text{VC}(\mathcal{H})=1$. Now choose a set $S=\{x_1,\dots,x_m\}$. There are $m+1$ ways to classify $S$ using points in $\mathcal{H}$, and this holds for any set of $m$ points. Therefore, $\Pi_\mathcal{H}(m)=m+1=\binom{m}{0}+\binom{m}{1}=\sum\limits_{i=0}^{1}\binom{m}{i}$.
- Let $\mathcal{F}\subseteq \{f:\mathcal{X}\to [-1,1]\}$ and define a loss function $\ell:\mathbb{R}\to\mathbb{R}$ by $\ell(z)=\frac{1}{2}|z|$. Set $L(f)=\mathbb{E}_{(x,y)\sim \mathcal{D}} [\ell(y-f(x))]$ and $\widehat{L}_S(f)=\frac{1}{m}\sum\limits_{i=1}^m \ell(y_i-f(x_i))$ for a set $S=\{(x_i,y_i)\}_{i=1}^m$ where $y\in\{-1,1\}$. Prove that with probability at least $1-\delta$ over $S\sim \mathcal{D}^m$, for all $f\in\mathcal{F}$, $$L(f)\leq \widehat{L}_S(f)+\mathfrak{R}(\mathcal{F})+\sqrt{\frac{\log 1/\delta}{2m}}.$$ Hint. First, define a class for functions $(x,y)\mapsto y-f(x)$ and compute its Rademacher complexity. Then, use the property that $\ell(\cdot)$ is a $\frac{1}{2}$-Lipschitz function.
Solution. Define $\mathcal{L}=\{(x,y)\mapsto \frac{1}{2}|y-f(x)|:f\in\mathcal{F}\}$ and $\mathcal{G}=\{(x,y)\mapsto y-f(x):f\in\mathcal{F}\}$. We have $$\mathfrak{R}(\mathcal{G})=\mathbb{E}_{S,\sigma}\left[\sup_{g\in\mathcal{G}}\frac{1}{m}\sum_{i=1}^m \sigma_i g(x_i)\right]=\mathbb{E}_{S,\sigma}\left[\sup_{f\in\mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i (y_i-f(x_i))\right]$$ $$=\mathbb{E}_{S,\sigma}\left[\sup_{f\in\mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i y_i -\sigma_i f(x_i)\right]=\mathbb{E}_{S,\sigma}\left[\sup_{f\in\mathcal{F}}\frac{1}{m}\sum_{i=1}^m \sigma_i f(x_i))\right]=\mathfrak{R}(\mathcal{F}).$$
Note that $\mathcal{L}=\ell\circ \mathcal{G}$, so by Talagrand's lemma, $\mathfrak{R}(\mathcal{L})\leq \frac{1}{2}\mathfrak{R}(\mathcal{G})=\frac{1}{2}\mathfrak{R}(\mathcal{F})$. The result follows by applying the Rademacher complexity generalization bound on $\mathcal{L}$; we can do this because $\mathcal{F}$ maps to $[-1,1]$ so then $\mathcal{L}$ maps to $[0,1]$ as needed.