CS7545_Sp24_Lecture_13: VC‐Dimension Contd. & Sauer‐Shelah Lemma - mltheory/CS7545 GitHub Wiki
Next Tuesday, we have an exam!
This is for binary classification. You can replace the 2 with the number of classes for classification.
The above means that the set
To prove
-
$\exists$ a set of size$d$ which is shattered by$\mathcal{H}$ . -
$\nexists$ a set of size$d + 1$ which is shattered by$\mathcal{H}$ .
Question: For a given hypothesis class, if a set of size
Answer: Yes!
Let's say we have a sample of size d that is shattered. This means that for every
VC Dimension of Half-Spaces in $\mathbb{R}^{2}: sgn(w^{T}x + b)$ , where w, x $\in \mathbb{R}^{2}$ and $b \in \mathbb{R}$
Part 1: Exhibit a set of size d that is shattered
Any 3 points that are non-collinear can be shattered. If we want to classify them as all positive, we have our separating boundary below all the points, and if we want to classify them as all negative, we have our separating boundary above all the points. Since we have 3 non-collinear points, we can draw a line to separate any point from the other 2 points, which allows for all the possible classifications. Since all 8 classifications are possible, we say that for half-spaces in
Let's make the claim that VC(
We now need to prove that a set of size
Let's take a set of 4 points; can we shatter them?
With 4 points, there are 2 cases:
- Case 1: All four points lie on the convex hull
- Case 2: At least one point is inside the convex hull
Case 1: If 4 points are vertices on the convex hull, you cannot do alternating classifications (+, -, +, -, ..). If you have less than 4 vertices, you cannot do +1 on the vertices and -1 on everything else.
Case 2: If we have at least one point inside the convex hull, you cannot do +1 on the vertices of the convex hull, and -1 on the point(s) inside the convex hull.
Therefore, no set of size 4 is shattered for this hypothesis class, so we have proven that the VC(
We will have
Our prediction is still
Let's see if we can shatter
We can take the same example as with
We will now make the claim that VC(Halfspaces in
To prove this, we need to show that there does not exist a set of size
Radon's Theorem: Any set of
We can show an example in 2 dimensions. Let's say we have 4 points in a square. If we put the top-right corner and bottom-left corner in 1 set and the top-left corner and bottom-right corner in 1 set, we have intersecting convex hulls.
**Note: Radon's Theorem states that there exists at least 1 partition such that the convex hulls intersect, NOT that every partition results in convex hulls intersecting.
We can use this to show that we cannot shatter
Given: If
We know that by Radon's Theorem, that there exists a partition such that
Theorem: Suppose VC(
- This provides a much better bound for
$\Pi_{\mathcal{H}}(m)$ than$2^{m} \rightarrow \Sigma_{i=0}^{m} \binom{m}{i} = 2^m$
Corollary: Suppose VC(
The growth function has 2 behaviors:
- If
$VC(\mathcal H)=\infty$ , then$\Pi_\mathcal H(m) = 2^m$ . - If
$VC(\mathcal H)=d < \infty$ , then$\Pi_\mathcal H(m) = 2^m$ for$m< d$ ,$\Pi_\mathcal H(m) = O(m^d)$ for$m\geq d$ .
Why is this important? This tells us that the growth function after the VC dimension grows polynomially as a function of
Theorem: Uniform Convergence VC Dimension Bound
Let
Then for
The key term is
We can also remove the log factors (outside the scope of this class and is very hard), which leads us to:
Proof:
We use 3 specific Tools:
- Rademacher Generalization Bound:
$L(f) \leq \hat{L}_{s}(F) + 2R(F) + \sqrt{\frac{\log 1/\delta}{2m}}$ - Rademacher Growth Function Connection by Massart:
$R(\mathcal{H}) \leq \sqrt{\frac{2 log(\Pi_{\mathcal{H}}(m))}{m}}$ - Sauer-Shelah Lemma:
$\Pi_{\mathcal{H}}(m) \leq (\frac{em}{d})^{d}$
We can basically plug each of these and staple them together.
The following is with probability of
because we can replace
by plugging in Rademacher Growth Function Connection by Massart's
by applying Sauer-Shelah Lemma.
Key Takeaways:
- If
$VC(\mathcal H)=\infty$ , there is no guarantee for a generalization bound. - If
$VC(\mathcal H)=d$ there is no real guarantee of generalization bound if$m\leq d$ . However, there is a guarantee if$m >> d$ .
Rademacher variables have been there since the 2000s, and VC-dimension has been known since the 1970s. These were created pre-deep learning, as deep learning broke everything. How does it break everything? Let's take a neural network. There is a way to compute the VC(NN) approximately be equal to
This tells us a couple things. These bounds analyze in a very worst case scenario, while also not really considering the algorithm. For example, for some of these deep models, there are many ERMs. However, through some gradient descent, one of these ERMs is successful as shown through the performance. There are 2 main questions:
- How does gradient descent find the best ERM?
- How does gradient descent find a solution that shows great generalization?
-
Graded. A simplex in
$\mathbb{R}^n$ is the intersection of$n+1$ halfspaces (not necessarily bounded). Prove that the VC-dimension of simplices in$\mathbb{R}^n$ is$\mathcal{O}(n^2\log n)$ . Hint. Use the Sauer-Shelah lemma and the VC-dimension of halfspaces in$\mathbb{R}^n$ . -
Challenge, optional, 1 point extra credit. Prove the best lower bound you can on the VC-dimension of simplices in
$\mathbb{R}^n$ . You will receive the extra credit point if you either (i) prove a lower bound of$\Omega(n)$ and show a reasonable attempt at improving it, or (ii) prove a lower bound better than$\Omega(n)$ .