CS7545_Sp24_Lecture_13: VC‐Dimension Contd. & Sauer‐Shelah Lemma - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Notes for Lecture 13: VC‐Dimension Contd. & Sauer‐Shelah Lemma

February 27, 2024

Scribes: Raj Janardhan, Anshul Mittal

Next Tuesday, we have an exam!

Quick Refresher:

$$\mathcal H|_s = { (h(x_1),...,h(x_m)):h \in \mathcal H}$$

$$\Pi_{\mathcal{H}}(m) = \max_{{S \subset X, |S| = m}} |\mathcal{H}|s|$$

$$\Pi_{\mathcal{H}}(m) \leq 2^{m}$$

This is for binary classification. You can replace the 2 with the number of classes for classification.

$$|\mathcal{H}|s| = 2^{m}$$

The above means that the set $S$ is shattered.

$VC(\mathcal H)$ is the maximum value $d$ in which there exists a sample of size $d$ which is shattered.

To prove $VC(\mathcal{H}) = d$:

  1. $\exists$ a set of size $d$ which is shattered by $\mathcal{H}$.

  2. $\nexists$ a set of size $d + 1$ which is shattered by $\mathcal{H}$.

Question: For a given hypothesis class, if a set of size $d$ is shattered, does that imply for all values less than $d$, that there exists a set of that size that is shattered?

Answer: Yes!

Let's say we have a sample of size d that is shattered. This means that for every $x_{1}, x_{2}, x_{3}, ..., x_{d}$, every possibility is possible. All we have to do is remove a single element, and we have a sample of size $d-1$ that is shattered. We can do this for every value less than $d$, so we now know that if a set of size $d$ is shattered for a given hypothesis class, there exists a set of size $d-1, d-2, d-3, ..$ that is also shattered for the same hypothesis class.

Half-Space VC-Dimension Example

VC Dimension of Half-Spaces in $\mathbb{R}^{2}: sgn(w^{T}x + b)$, where w, x $\in \mathbb{R}^{2}$ and $b \in \mathbb{R}$

Part 1: Exhibit a set of size d that is shattered

Any 3 points that are non-collinear can be shattered. If we want to classify them as all positive, we have our separating boundary below all the points, and if we want to classify them as all negative, we have our separating boundary above all the points. Since we have 3 non-collinear points, we can draw a line to separate any point from the other 2 points, which allows for all the possible classifications. Since all 8 classifications are possible, we say that for half-spaces in $\mathbb{R}^{2}$, a set of size 3 is shattered.

Let's make the claim that VC($\mathcal{H}$) = 3.

We now need to prove that a set of size $d + 1$ is not shattered.

Let's take a set of 4 points; can we shatter them?

With 4 points, there are 2 cases:

  • Case 1: All four points lie on the convex hull
  • Case 2: At least one point is inside the convex hull

Case 1: If 4 points are vertices on the convex hull, you cannot do alternating classifications (+, -, +, -, ..). If you have less than 4 vertices, you cannot do +1 on the vertices and -1 on everything else.

Case 2: If we have at least one point inside the convex hull, you cannot do +1 on the vertices of the convex hull, and -1 on the point(s) inside the convex hull.

Therefore, no set of size 4 is shattered for this hypothesis class, so we have proven that the VC($\mathcal{H}$) = 3

Extension: Halfspaces in $\mathbb{R}^{n}$

We will have $n$ points: $e_{i} = (0, ..., 1, ..., 0)$, where the $i^{\text{th}}$ coordinate is 1 and the rest are 0.

Our prediction is still $sgn(w^{T}x + b)$. For every classification, we can set the $w_{i} = y_{i}$, where $y_{i}$ is 1 or -1, the label of the point $e_i$. If we set $b=0$, we get that $sgn(y^Te_i) = y_i$. Thus, we have shown that all permutations are possible with this set of points, so we can say that for this hypothesis class, the set of size $n$ is shattered.

Let's see if we can shatter $n + 1$ points.

We can take the same example as with $n$ points with an adjustment to include the origin. The weights we keep the same as with $n$ points, except now, we add a bias of $y_{0}/2$, where $y_{0}$ is what the classification for the origin needs to be in the permutation. Why does this work? This works for the $n$ points that are not the origin, because $|y_{0}/2| < 1$, so it will never change the sign. Therefore, we can do all classifications on the $n$ points. We can also do all permutations for the origin because $sgn(y_{0}/2) = y_{0}$. We have shown that all permutations are positive with this set of points, so we can say that for this hypothesis class, the set of size $n + 1$ is shattered.

We will now make the claim that VC(Halfspaces in $\mathbb{R}^{n}$) = $n + 1$.

To prove this, we need to show that there does not exist a set of size $n + 2$ that is shattered with the hypothesis class. For this, we will introduce a new theorem.

Radon's Theorem: Any set of $n + 2$ points in $\mathbb{R}^{n}$ can be partitioned into 2 sets $X_{1}$ and $X_{2}$ such that the convex hull of $X_{1}$ and $X_{2}$ intersect $\rightarrow Conv(X_{1}) \cap Conv(X_{2}) \neq \emptyset$

We can show an example in 2 dimensions. Let's say we have 4 points in a square. If we put the top-right corner and bottom-left corner in 1 set and the top-left corner and bottom-right corner in 1 set, we have intersecting convex hulls.

**Note: Radon's Theorem states that there exists at least 1 partition such that the convex hulls intersect, NOT that every partition results in convex hulls intersecting.

We can use this to show that we cannot shatter $n + 2$ points in our halfspace example.

Given: If $X_{1}, X_{2}$ are separated by a hyperplane, where points in $X_{1}$ are positive and $X_{2}$ are negative (or vice versa), the convex hulls are separated.

We know that by Radon's Theorem, that there exists a partition such that $X_{1}, X_{2}$ have intersecting hulls. This means that we cannot separate the points by a hyperplane, which means this partition (permutation of classification of points) cannot be achieved by the hypothesis class. This shows that not all permutations are possible, and therefore VC($\mathcal{H}$) = $n + 1.$

Sauer-Shelah Lemma

Theorem: Suppose VC($\mathcal{H}) = d < \infty$. Then for all m, $\Pi_{\mathcal{H}}(m) \leq \Sigma_{i=0}^{d} \binom {m}{i}$

  • This provides a much better bound for $\Pi_{\mathcal{H}}(m)$ than $2^{m} \rightarrow \Sigma_{i=0}^{m} \binom{m}{i} = 2^m$

Corollary: Suppose VC($\mathcal{H}$) = $d$. For $m \geq d$, $\Pi_{\mathcal{H}}(m) \leq (\frac{em}{d})^{d} = O(m^d)$

The growth function has 2 behaviors:

  1. If $VC(\mathcal H)=\infty$, then $\Pi_\mathcal H(m) = 2^m$.
  2. If $VC(\mathcal H)=d < \infty$, then $\Pi_\mathcal H(m) = 2^m$ for $m< d$, $\Pi_\mathcal H(m) = O(m^d)$ for $m\geq d$.

Why is this important? This tells us that the growth function after the VC dimension grows polynomially as a function of $m$, rather than exponentially as a function of $m$. If the VC dimension is infinite, then the growth function will just grow exponentially forever.

Rachemacher Complexity and VC-Dimension

Theorem: Uniform Convergence VC Dimension Bound

Let $\mathcal{H} \subseteq {h: X \rightarrow [-1, 1]}$ with VC($\mathcal{H}$) = $d$, where $d < \infty$

Then for $\delta > 0$, with probability $\geq 1 - \delta$ over $S$ ~ $D^{m}$ for $h \in \mathcal{H}$: $L(h) \leq \hat{L}_{s}(h) + \sqrt{\frac{2d \log(\frac{em}{d})}{m}} + \sqrt{\frac{\log(1/\delta)}{2m}}$

The key term is $\frac{d}{m}$. To get a good bound, we need $m >> d$. We know that the generalization error is bound by 1, so if $m$ and $d$ are relatively equal, our upper bound will be so high that it will not give us any meaningful information. This shows us why we need $m >> d$.

We can also remove the log factors (outside the scope of this class and is very hard), which leads us to:

$$L(h) \leq \hat{L}_{s}(h) + O(\sqrt{\frac{d}{m}}).$$

Proof:

We use 3 specific Tools:

  1. Rademacher Generalization Bound: $L(f) \leq \hat{L}_{s}(F) + 2R(F) + \sqrt{\frac{\log 1/\delta}{2m}}$
  2. Rademacher Growth Function Connection by Massart: $R(\mathcal{H}) \leq \sqrt{\frac{2 log(\Pi_{\mathcal{H}}(m))}{m}}$
  3. Sauer-Shelah Lemma: $\Pi_{\mathcal{H}}(m) \leq (\frac{em}{d})^{d}$

We can basically plug each of these and staple them together.

The following is with probability of $1 - \delta$

$$L(h) \leq \hat{L}_{s}(\mathcal{H}) + R(\mathcal{H}) + \sqrt{\frac{\log 1/\delta}{2m}}$$

because we can replace $2R(\mathcal{F}) = R(\mathcal{H})$

$$L(h) \leq \hat{L}_{s}(\mathcal{H}) + \sqrt{\frac{2\log \Pi(m)}{m}} + \sqrt{\frac{\log 1/\delta}{2m}}$$

by plugging in Rademacher Growth Function Connection by Massart's

$$L(h) \leq \hat{L}_{s}(\mathcal{H}) + \sqrt{\frac{2\log (\frac{em}{d})^{d}}{m}} + \sqrt{\frac{\log 1/\delta}{2m}}$$

$$L(h) \leq \hat{L}_{s}(\mathcal{H}) + \sqrt{\frac{2d\log (\frac{em}{d})}{m}} + \sqrt{\frac{\log 1/\delta}{2m}}$$

by applying Sauer-Shelah Lemma.

Key Takeaways:

  1. If $VC(\mathcal H)=\infty$, there is no guarantee for a generalization bound.
  2. If $VC(\mathcal H)=d$ there is no real guarantee of generalization bound if $m\leq d$. However, there is a guarantee if $m >> d$.

Why do we study this? State of MLT Research

Rademacher variables have been there since the 2000s, and VC-dimension has been known since the 1970s. These were created pre-deep learning, as deep learning broke everything. How does it break everything? Let's take a neural network. There is a way to compute the VC(NN) approximately be equal to $O(\text{width * depth})$, where width is the number of parameters per layer and depth is the depth of the network. When looking at networks life GPT, Llama, etc., the VC-dimension is massive (into the quadrillions and more). When looking at the training data for these deep models, they are large but they are still less than the VC-dimension. With these DL models, we do not get the situation where $m >> d$, which is where our generalization bound with VC makes a difference. However, these models still show great performance.

This tells us a couple things. These bounds analyze in a very worst case scenario, while also not really considering the algorithm. For example, for some of these deep models, there are many ERMs. However, through some gradient descent, one of these ERMs is successful as shown through the performance. There are 2 main questions:

  1. How does gradient descent find the best ERM?
  2. How does gradient descent find a solution that shows great generalization?

Homework

  1. Graded. A simplex in $\mathbb{R}^n$ is the intersection of $n+1$ halfspaces (not necessarily bounded). Prove that the VC-dimension of simplices in $\mathbb{R}^n$ is $\mathcal{O}(n^2\log n)$. Hint. Use the Sauer-Shelah lemma and the VC-dimension of halfspaces in $\mathbb{R}^n$.

  2. Challenge, optional, 1 point extra credit. Prove the best lower bound you can on the VC-dimension of simplices in $\mathbb{R}^n$. You will receive the extra credit point if you either (i) prove a lower bound of $\Omega(n)$ and show a reasonable attempt at improving it, or (ii) prove a lower bound better than $\Omega(n)$.

⚠️ **GitHub.com Fallback** ⚠️