CS7545_Sp24_Lecture_12 - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 12

February 22, 2024

Scribes: Jonathan Zheng

Last lecture, given a function class $\mathcal{F}$, we saw that with probability $1 - \delta$, for all $f \in \mathcal{F}$,

$$ L(f) \leq \widehat{L}_S(f) + 2\mathfrak{R}(\mathcal{F}) + \sqrt{\frac{\log \frac{1}{\delta}}{2m}}$$

Interestingly, this gives a theoretical basis for regularization: minimizing some other term in addition to loss function by minimizing Rademacher bound.

Consider a fixed $m, \delta$. We vary the function class $\mathcal{F}$.

$$L(f) \leq \widehat{L}_S(f) + 2\mathfrak{R}(\mathcal{F}) + \text{constant}$$

Choose $\mathcal{F}$ to minimize $\widehat{L}_S(f) + 2\mathfrak{R}(\mathcal{F})$. We will be minimizing the upper bound on generalization error.

Example: $\mathcal{F} = \{ x \mapsto \langle w, x \rangle : |w|_2 \leq B, | x |_2 \leq C \}$. Then,

$$\widehat{R}_S(\mathcal{F}) \leq \frac{BC}{\sqrt{m}} $$

$$L(f) \leq \widehat{L}_S(f) + \frac{2BC}{\sqrt{m}}$$

Fix the values $C, m$. How should we minimize this upper bound on $L(f)$?

We use $l_2$-regularization and design a loss function $l(w, s) = \widehat{L}_S(w) + \lambda \| w\|_2^2, \lambda > 0$. This is the SVM.

VC-Dimension

For a hypothesis class $\mathcal{H}$ and sample $S$ of $m$ points, the restriction $\mathcal{H|}_S = \{(h(x_1), \ldots, h(x_m)): h \in \mathcal{H}\}$. (Total # of ways to classify $S$). We noticed that while $|\mathcal{H}|$ may be $\infty$, $|\mathcal{H|}_S| \leq 2^m.$

We define the growth function of $\mathcal{H}$ as the max number of ways to classify $m$ points: $$\Pi_{\mathcal{H}}(m) = \max_{S \subseteq X, |S| = m} |\mathcal{H|}_S|$$

The VC Dimension of $\mathcal{H}$ is:

$$VC(\mathcal{H}) = \max{m: \Pi_{\mathcal{H}}(m) = 2^m}$$

$S$ is shattered if $|\mathcal{H|}_S| = 2^m$.

Note: if instead of $h: X \rightarrow \{-1, 1\}$, we had $h: X \rightarrow Y$, $|Y| < \infty$, then replace all $2^m$ with $|Y|^m$.

Important. To prove $VC(\mathcal{H}) = d$:

  1. $\exists$ a set of size $d$ which is shattered by $\mathcal{H}$.

  2. $\nexists$ a set of size $d + 1$ which is shattered by $\mathcal{H}$.

Examples:

  1. $\mathcal{H}$ with $|\mathcal{H}| < \infty$.

We want to shatter $r$ points. How many hypotheses are needed?

We need at least $2^r$ different hypotheses as there are $2^r$ possible classifications of $r$ points.

Thus, $VC(\mathcal{H}) \leq \log | \mathcal{H} |$

  1. $\mathcal{H} = \{\text{Intervals in } \mathbb{R}\}$. Interval is $[a, b]$ for $a, b \in \mathbb{R}$, $x = \mathbb{R}$.

$$h(x) = \begin{cases} +1, \text{if } x\in [a, b] \\ -1, \text{else. } \end{cases}$$

$\exists$ a set of size $1$ which is shattered? Yes! (interval includes the point or excludes the point.)

$\exists$ a set of size $2$ which is shattered? Yes! (interval includes both points, excludes both points, and 2 cases where includes one point but excludes the other.)

No set of size $3$ is shattered. Along the number line, a set of $3$ points with labels $+1, -1, +1$ cannot occur because you cannot use multiple intervals. Thus, $VC(\mathcal{H}) = 2$.

  1. $\mathcal{H} = \{\text{axis-aligned rectangles in } \mathbb{R}^2 \}$ For some $a, b, c, d \in \mathbb{R}$.

$$h(x) = \begin{cases} +1, \text{if } x\in [a, b] \times [c, d]\\ -1, \text{else. } \end{cases}$$

$\exists$ a set of $3$ shattered points? Yes! Can shatter $3$ points as long as they are not colinear (A set of $3$ points that form a triangle can be classified all $8$ ways).

$\exists$ a set of $4$ shattered points? Yes! (choose a rhombus that is not aligned with an axis).

The convex hull is the smallest convex set containing all points. We prove that no $5$ points can be shattered.

Claim: a point inside the convex hull cannot be made $-1$ if others are $+1$ or if all points are on the convex hull. Additionally, you cannot make one of them $-1$ and others $+1$.

Thus, $VC(\mathcal{H}) = 4$.

What does $VC(\mathcal{H}) = \infty$ mean? It means that hypotheses in the class can shatter any number of points.

Example:

  1. $\mathcal{H} = \{ \text{all convex polygons in } \mathbb{R}^2\}$.

Put all points on a circle and create a polygon where the number of sides is $\max\{\text{number of +1's}, \text{number of -1's}\}$. This arrangement can shatter any number of points.

  1. $\mathcal{H} = \{ x \mapsto \text{sgn}(\sin(\alpha x)): \alpha \in \mathbb{R}\}$.

$VC(\mathcal{H}) = \infty$. We can adjust the frequency to shatter any number of points.

This is interesting! We only have $1$ parameter $\alpha$ in the class, yet the complexity of the class with VC dimension is $\infty.$

Homework

  1. Graded. What is the VC-dimension of a union of $k$ intervals on the real line?

  2. Graded. What is the VC-dimension of axis-aligned hyperrectangles in $\mathbb{R}^n$? An axis-aligned hyperrectangle $A$ in $\mathbb{R}^n$ is defined by $A=[x_1,y_1]\times \cdots \times [x_n, y_n]$ for $x_1,y_1,\dots,x_n,y_n\in\mathbb{R}$.

⚠️ **GitHub.com Fallback** ⚠️