CS7545_Sp24_Practice_Exam_1 - mltheory/CS7545 GitHub Wiki
CS 7545: Machine Learning Theory -- Spring 2024
Practice Problems for Exam 1
Norm Inequality in Unit Ball:
For $x \in \mathbb{R}^N$ and $p > 1$, show that if $\|x\|_p = 1$, then $\|x\|_\infty \leq 1$. Is the converse true? Justify your answer.
Solution:
If $\|x\|_p = 1$, then by definition:
$$\left(\sum_{i=1}^N |x_i|^p\right)^{1/p} = 1$$
This implies that for each $i$, $|x_i|^p \leq 1$, and hence $|x_i| \leq 1$. Therefore, $\|x\|_\infty \leq 1$. The converse is not necessarily true, just consider the all-1's vector, which has $\infty$-norm equal to 1, but p-norm equal to $N^{1/p}$.
Equality of Vector Norms:
Provide an example of a vector $x \in \mathbb{R}^N$ such that $\|x\|_1 = \|x\|_2 = \|x\|_\infty$. Explain why this equality holds for your example.
Solution:
The all-zeros vector satisfies this equality
Cauchy-Schwarz Inequality in Probability Distributions:
Let $p \in \Delta_N$ be a probability distribution. Use the Cauchy-Schwarz inequality to prove that:
Strong Convexity:
Show that if a differentiable function $f$ is convex and $\lambda > 0$, then the function $g(x) := f(x) + \frac \lambda 2 \|x\|^2$ is $\lambda$-strongly convex.
Solution:
Since $f$ is differentiable and convex, we have that for any $x,y \in \text{dom}(f)$,
The inequality follows by simply using (*), and the final equality comes from rearranging terms. However, we are now done, since the above inequality is the definition of strong convexity of $g$.
Fenchel Conjugate of Exponential Function:
Given a convex function $f(x) = e^x$, determine its Fenchel conjugate $f^*(y)$.
Solution:
The Fenchel conjugate of $f$ is given by $f^*(y) = \sup_{x \in \text{dom}(f)} \{yx - f(x)\}$. In this case, $f^*(y) = \sup_{x} \{yx - e^x\}$. Setting the derivative of the inner function to zero gives the maximizing $x = \ln y$ and subsequently $f^*(y) = y \ln y - y.$.
Modified Random Walk Probability:
In a classic random walk, where on each time period you take one step forward or backward with equal probability, it is a well-known fact that after $n$ time periods you typically end up about $\sqrt{n}$ steps away from where you started. Consider a modified random walk where on each time period $i$, you take a step forward or backward with equal probability, and the step size is $i^{-1/2}$. Let $X_i$ be the random variable representing the step taken at time $i$, with $X_i = i^{-1/2}$ or $X_i = -i^{-1/2}$ with equal probability. Show that after $n$ time periods, the probability that the random walk is more than $10\sqrt{\log n}$ distance from the start is less than 1 in a million.
Solution:
We denote the sum of steps by $S_n = \sum_i X_i$. The expected position after $n$ steps is $E[S_n] = 0$ since the walk is symmetric. We want to bound the probability:
$$P(|S_n| > 10\sqrt{\log n}).$$
By Hoeffding's Inequality, for independent bounded random variables $X_i$ with $a_i \leq X_i \leq b_i$, we have:
For our random walk, $a_i = -i^{-1/2}$ and $b_i = i^{-1/2}$, hence $(b_i - a_i)^2 = 4i^{-1}$. The sum $\sum (b_i - a_i)^2$ then becomes $4\sum_{i=1\cdots n} i^{-1}$, which is approximately $4\log n$ for large $n$. Setting $t = 10\sqrt{\log n}$, we get:
Since $2\exp(-50)$ is much less than $1/1,000,000$, we conclude that the probability of the random walk being more than $10\sqrt{\log n}$ steps away from the start after $n$ periods is less than 1 in a million.
Classification with Abstention:
Let $\mathcal{X}$ be an input space and $\mathcal{D}$ be a distribution on $\mathcal{X}\times \{-1,1\}$. Recall that $\eta(x)=\Pr[Y=1|X=x]$.
In some situations, it makes sense for a classifier to abstain (i.e., predict "I don't know") on instances $x\in\mathcal{X}$ for which it is unsure. Suppose we evaluate the classifier as follows:
If the classifier makes a prediction (either $-1$ or $1$), it incurs a cost of $0$ if it is correct and a cost of $1$ if it is incorrect.
If the classifier chooses to abstain, it incurs a cost of $\theta\in [0,\frac{1}{2}]$.
Which classifier $h:\mathcal{X}\to\{-1,1,\text{ abstain}\}$ has the lowest expected error for this problem with respect to $\mathcal{D}$? Write $h(x)$ as a function of $\eta$ and $\theta$.
Solution:
For a given $x$, the expected cost incurred by a classifier is $\theta$ if it abstains, $\eta(x)$ if the classifier predicts $-1$, and $1-\eta(x)$ if the classifier predicts $1$. The best classifier should choose the option which incurs the minimum of these three costs for each $x$. Hence, it should abstain when $\theta=\min(\theta, \eta(x), 1-\eta(x))$. Therefore,