CS7545_Sp23_Lecture_11: Introduction to VC Dimensions - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler Labonte

Notes for Lecture 11

February 14, 2023

Scribes: Tianle Huang, Joshua Hsu

Finishing Last Lecture's Proof

The first part of this lecture was dedicated to prove the following theorem:

Theorem 2: Generalization bounds based on Rademacher Complexity

Suppose $\mathcal F \subseteq { \mathcal X \mapsto [0, 1] }$. Then, for any $0 < \delta < 1$, with probability $\geq 1-\delta$ over sample $S\sim D^m$, then for all $f\in \mathcal F$, we have that:

  1. $L(f) \leq \hat L_S(f) + 2 \mathfrak R (\mathcal F) + \sqrt{\frac{\log \frac{1}{\delta}}{2m}}$
  2. $L(f) \leq \hat L_S(f) + 2 \hat{\mathfrak R}_S (\mathcal F) + 3\sqrt{\frac{\log \frac{1}{\delta}}{2m}}$

We need to utilize McDiarmid's Inequality in this proof:

McDiarmid's Inequality (aka bounded differences)

Suppose $g: X^m \rightarrow \mathbb{R}$ and $\exists C_i$, $1 \leq i \leq m$ s.t. $\forall s, s'$ differing in $i$-th component,

$$|g(s) - g(s')| \leq C_i $$

What does it mean?

Suppose,

$$s = (x_1, x_2, ..., x_i, x_{i+1}, ..., x_m) $$

$$s = (x_1, x_2, ..., {x_i}', x_{i+1}, ..., x_m) $$

If I change one data dimension then there would not be a lot of change in function value.

Then, $\forall \epsilon \geq 0$,

$$Pr_{S \sim D^m} [|g(s) - \mathbb{E}{S \sim D^m} g(s)| \geq \epsilon ] \leq 2 \exp{\frac{-2\epsilon^2}{\sum\limits{i=1}^{m}{C_i}^2}}$$

Going back to the original proof:

Let $g(s) := \sup_{f\in \mathcal F} [L(f) - \hat L_S(f)]$. Then,

$$ \begin{align} g(S') - g(S) &= \sup_{f\in \mathcal F} [L(f) - \hat L_{S'}(f)] - \sup_{f\in \mathcal F} [L(f) - \hat L_{S}(f)] & \text{by subaddivity} \\ &\leq \sup_{f\in \mathcal F} [L(f) - \hat L_{S'}(f) - (L(f) - \hat L_{S}(f))] \\ &= \sup_{f\in \mathcal F} [L(f) - \hat L_{S'}(f) - L(f) + \hat L_{S}(f)] \\ &= \sup_{f\in \mathcal F} [\hat L_{S}(f) - \hat L_{S'}(f) ] \\ &= \sup_{f\in \mathcal F} \left [ \frac{f(x_i) - f(x_i')}{m} \right ] &\text{since $S$ and $S'$ differ only in the $i^{th}$ component}\\ &\leq \frac{1}{m}\\ \end{align} $$

From McDiarmid's inequality, we can write the following:

$$Pr_{S \sim D^m}[g(s)-\mathbb{E}_{S \sim D^m} g(s') \geq \epsilon] \leq exp(-2m\epsilon^2)$$

Now if we consider:

$$exp(-2m\epsilon^2)=\delta$$

And solve for $\epsilon$:

$$\epsilon=\sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

Now let's rewrite the result from before in terms of $\delta$:

$$Pr_{S \sim D^m}\left [g(s)-\mathbb{E}_{S \sim D^m} g(s') \geq \sqrt{\frac{log\frac{1}{\delta}}{2m}}\right ] \leq \delta$$

If we consider the complement, $1-\delta$ we can switch the direction of the inequalities:

$$Pr_{S \sim D^m}\left [g(s)-\mathbb{E}_{S \sim D^m} g(s') \leq \sqrt{\frac{log\frac{1}{\delta}}{2m}}\right ] \geq 1-\delta$$

Let's consider

$$\star=g(s)-\mathbb{E}_{S \sim D^m} g(s') \leq \sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

Where $\star$ is just the value inside $Pr_{S \sim D^m}$ in the previous part.

For the next portion of this proof, we will be utilizing symmetrization. This was covered in the last lecture, so for more details on symmetrization, please refer to Lecture 10.

Using symmetrization, we can write:

$$\mathbb{E}_{S \sim D^m} \sup_{f \in \mathcal{F}} L(f) - \hat L_S(f) \leq 2\mathfrak R(\mathcal{F})$$

As you can see, the left side of the above equation is equivalent to $\mathbb{E}_{S \sim D^m}g(s)$.

This means that the following is true:

$$\mathbb{E}_{S \sim D^m}g(s)\leq 2\mathfrak R(\mathcal{F})$$

If we use this result in $\star$ while also substituting the definition of $g(s)$, we get the following:

$$\sup_{f \in \mathcal{F}} L(f) - \hat L_S(f) \leq 2\mathfrak R(\mathcal{F})+\sqrt{\frac{log\frac{1}{\delta}}{2m}}$$

If we consider this result for all $f\epsilon \mathcal F$ simultaneously, we can drop the supremum, which yields:

$$L(f) \leq \hat L_S(f) + 2 \mathfrak R (\mathcal F) + \sqrt{\frac{\log \frac{1}{\delta}}{2m}}$$

This matches the original inequality given in the theorem statement, thus proving the theorem.

VC Dimensions

While Rademacher Complexity was invented in the 90's, VC Dimension was invented in the 60's by two Russians, Vapnik and Chervonenkis, and hence the name. In a sense, VC Dimensions are simpler than Rademacher Complexities, because it has some sort of "geometric meaning". So if you are already tired of Rademacher Complexities, you will probably like VC Dimensions.

Besides being two very different ways of measuring the "complexity" of function classes, there are two important distinctions between Rademacher Complexities:

  • Rademacher Complexity is data/sample dependent while VC Dimension is not.
  • Rademacher Complexity is hard to compute while VC Dimension is "easier".

Now, let's get into the actual definitions for VC Dimension.

Definition: The restriction of

$$\mathcal H \subseteq \left\{n: \mathcal X \rightarrow \{-1, 1\}\right\}$$$

to a set

$$S = \{x_1, \dots, x_m\}$$

is defined as

$$\left.\mathcal H\right|_s = \left\{(h(x_1), \dots, h(x_m)) : h \in \mathcal H\right\}$$

or in other words, all possible labelings of $S$ via $\mathcal H$.

Here is one example: If $m = 2$, that is, we have 2 points. $h(x_1)$ is always $-1$, and $h(x_2)$ can be $-1$ or $1$. The restriction,

$$\left.\mathcal H\right|_s = \{(-1, 1), (-1, -1)\}$$

Now, the astute among you may say, "Wait a minute, this is still data dependent! Didn't we say VC Dimension is data independent?" Well, you are correct, we will make it independent momentarily.

Definition: The growth function $\Pi_{\mathcal H} : \mathbb N \rightarrow \mathbb N$ (Where $\mathbb N$ is the set of natural numbers, including 0) is defined as:

$$\Pi_{\mathcal H}(m) = \underset{S \subseteq \mathcal X, |S| = m}{\text{max}}\big|\left.\mathcal H\right|_s\big|$$

i.e. the max size of restriction of a function class with respect to all possible data set of size $m$. Note, by taking the maximum, we have removed the data dependency.

Also note $\Pi_{\mathcal H}(m) \le 2^m$ as even if every labeling is possible, each data point can only be labeled as $-1$ or $1$. Since there are $m$ data points, we can have at maximum $2^m$ labelings.

Whenever we have something nice happen in mathematics, we will name it, which leads to the next definition.

Definition: A set of $m \ge 1$ points is shattered if

$$\Pi_{\mathcal H}(m) = 2^m$$

or equivalently

$$\big|\left.\mathcal H\right|_s\big| = 2^m$$

In other words, a set of points is shattered if every possible of labeling is achievable with the function class.

Now, we can finally introduce the definition of VC Dimension.

Definition: The VC Dimension of function class $\mathcal H$ is the size of the largest set which is shattered by $\mathcal H$. Or, in mathematical notations:

$$\text{VC-dim}(\mathcal H) = \text{max}\left\{m: \Pi_{\mathcal H}(m) = 2^m\right\}$$

Important: For $\text{VC-dim}(\mathcal H) = d$, the following two things must both happen:

  1. There exists some set of $d$ points shattered by $\mathcal H$, AND
  2. There does not exist any set of $d + 1$ points shattered by $\mathcal H$.

(Note, if there are no $d+1$ points shattered by $\mathcal H$, there cannot be a set $A$ of $\ge d+1$ points shattered by $\mathcal H$, since if otherwise, removing random points from $A$ until the size of $A'$ is $d+1$ will give you a set of $d+1$ points shattered by $\mathcal H$, hence contradiction.)

Now let's see some examples.

Example 1:

Let function class $\mathcal H$ be defined as:

$$\mathcal H = \left\{h = \begin{cases} \text{+}1 & \text{if } x\in [a, b] \\\ \text{−}1 & \text{otherwise} \end{cases}: a, b \in \mathbb R\right\}$$

Or in other words, $\mathcal H$ is all closed intervals in $\mathbb R$.

image

Now, let's see if $\mathcal H$ can shatter 1 point. For some point, we can either include it in the interval, or we don't. Therefore, it can indeed shatter 1 point. To see it graphically:

image

Can it shatter 2 points? Well yes!

image

Note that, for the same set of points, we must check if it is possible to achieve every combination of labels, before we declare it can be shattered.

Now, let's look at the case with 3 points. If we try to check every combination like above, we soon run into issues. Specifically, with the labels $(+1, -1, +1)$ for points from left to right. In particular, we must start the interval before the first point, so it can be labeled as $+1$. However, we must end the interval before the second point, to assign it $-1$. Nevertheless, the problem arises with the third point, as we can no longer include it in our interval.

image

Since we can shatter 2 points but not 3, we can declare the VC dimension of $\mathcal H$ is 2.

Example 2:

For the second example, let's consider the function class $\mathcal H$, which is defined as following:

$$\mathcal H = \left\{h = \begin{cases} \text{+}1 & \text{if } x\in [a, b] \text{ and } y\in [c, d] \\\ \text{−}1 & \text{otherwise} \end{cases}: a, b, c, d \in \mathbb R\right\}$$

In other words, it represents all rectangular regions in $\mathbb R^2$.

image

Now, can this new function class shatter 3 points? Well, for example, if we line up those 3 points on the same "level", we will soon run into the same issue as with the $\mathbb R$ interval case:

image

Does that mean the VC dimension is also 2? Well, no. Because we don't need to shatter every set of $d$ points. We just need to shatter any one set.

The following, for example, works:

image

You can check if it is indeed possible to get every combination. (Spoiler alert, it is possible.)

How about 4 points?

Actually, yes! It is possible, as long as you arrange those 4 points in a "cross formation":

image

How about 5 points?

Well, you have to wait until next lecture to find out. :)

⚠️ **GitHub.com Fallback** ⚠️