CS7545_Sp23_Lecture_26: VC Dimension of Neural Networks - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler LaBonte

Notes for Lecture 26

April 20, 2023

Scribes: Rishi Banerjee, Anastasiia Alokhina

VC-Dimension of Linear Threshold Nets

Consider an arbitrary neural network, with edges having weight $w_e \in \mathbb{R}$, nodes having bias $\theta_r \in \mathbb{R}$. For any node $r$, if $r$ recieves inputs $z_1, z_2, \dots, z_k$, the output of the node is $$\Phi(\sum_{i=1}^{k} w_e z_e + \theta_r)$$ where the $\Phi(\cdot)$ is an activation function. Assume that the network is a linear threshold network, where the outputs of the function are in ${-1,1}$.

Variables to Consider:

Let $u$ be the number of computation nodes (so not including inputs nodes)

Let $w$ be the number of parameters (weights and biases), or number of edges

Let $d$ be the depth (number of layers excluding input) of the neural network

Theorem 1:

For $H$, a hypothesis class for neural networks such that each node computes in ${-1,1}$, where $H_i$ is the class computed by $u_i$, then VC-dim($H$) = $d_i$, where $d_i = \deg(u_i) + 1$, so that $$d = \sum_{i=1}^{u} d_i$$

For $m \geq \max_{i} d_i$, $$\Pi_{H} (m) \leq \prod_{i=1}^{u} \Pi_{H}(m) \leq (\frac{emu}{d})^d$$

Corollary:

For a linear threshold network, for $m \geq w$, $$\Pi_{H}(m) \leq (\frac{emu}{w})^w$$ and thus VC-dim($H$) = $O(w\log u)$, which was found in last class

Proof of the second inequality:

By the Sauer-Shelah Lemma, $$\prod_{i=1}^{u} \Pi_{H}(m) \leq (\frac{em}{d_i})^{d_i}$$

To get rid of the product let's take the log of both sides:
$$\log \prod_{i=1}^{u} \Pi_{H}(m) \leq \sum_{i=1}^{u} \log d_i (\frac{em}{d_i})$$

Performing an affine scaling on the sum results in the following:

$$\frac{1}{d} \sum_{i=1}^{u} d_i \log (\frac{em}{d_i}) + \log(\frac{d}{em}) = \sum_{i=1}^{u} \frac{d_i}{d} \log (\frac{em}{d_i}) + \sum_{i=1}^{u} \frac{d_i}{d} \log(\frac{d}{em}) = \sum_{i=1}^{u} \frac{d_i}{d} \log(\frac{d}{d_i})$$

This final term is similar to an entropy term over the distribution $\frac{d_i}{d}$ for $i \in [u]$. Since entropy is maximized over a uniform distribution, it must be true that $$\sum_{i=1}^{u} \frac{d_i}{d} \log(\frac{d}{d_i}) \leq \sum_{i=1}^u \frac{1}{u} \log(u) = \log(u)$$

Since it must be true that $$\sum_{i=1}^{u} d_i \log(\frac{em}{d_i}) = d(\frac{1}{d} \sum_{i=1}^{u} d_i \log (\frac{em}{d_i}) + \log(\frac{d}{em})) - d \log(\frac{d}{em}) \leq d(\log u - \log(\frac{d}{em})) = d \log(\frac{emu}{d})$$

Thus, it must be true that since $$\log \prod_{i=1}^{u} \Pi_{H}(m) \leq d \log(\frac{emu}{d})$$ it must be true that $$\prod_{i=1}^{u} \Pi_{H}(m) \leq (\frac{emu}{d})^d$$

Note: this also can be proved using Jensen's inequality.

VC-Dimension of More Complex Networks

The next theorem basically claims that if number of parameters is finite then the VC-dim is finite as well, which is a powerful statement.

Theorem 2:

Let $h: \mathbb{R^{w}} \times \mathbb{R}^{n} \to {-1,1}$, which represents a parametrized neural network architecture with $w$ weights and $n$ inputs. Let the hypothesis class $H = {x \to h(w,x), w \in \mathbb{R}^w }$. Let the allowable operations for the function $h$ be only

  • Arithmetic: $+, - ,\times, \div$,
  • Comparisons $> ,\geq , < , \leq , = , \neq$,
  • Outputting -1 or 1.

If $h$ can be computed by Algorithm $A$ in a $t$ operations, it must be true that VC-dim($H$) $\leq 4w(t+2)$. (Wow, so despite the possibility of the very complex structure we still can get a bound of its VC-dim?)

This result will not be proven, the proof is in Anthony and Bartlett's Neural Network Learning: Theoretical Foundations (1999).

The class above includes all neural networks that operate using the listed computations, so no sigmoids or trigonometric functions. This is because they would require Taylor series to compute exactly, which would take an infinite number of operations!

Types of networks excluded form this theorem:

  • sigmoid, tanh, etc
  • RNN (because of the DAG assumption)

Note: This is true only when considering exact values of real-valued computations. When considering floating point numbers this would mean that however large, the hypothesis class would be finite, so the VC-dimension would be bounded by $\log_2 (|H|)$. Thus, with $w$ weights and a bit-precision $b$, it must be true that the VC-dim($H$) = $O(wb)$.

Corollary 1:

For a linear threshold net, time needed for forward pass is $t = O(w)$. This value is technically $O(w + u)$, but it is $O(w)$ since $w \geq u$.

Thus, it must be true that VC-dim($H$) = $O(w^2)$.

Note: Our analysis is rough and this bound is worse than we got before. But look at this neat generalization!

Corollary 2:

If $H$ is the hypothesis class of a network with piecewise polynomial activation with $\leq p$ pieces and degree $\leq l$, it must be true that VC-dim($H$) = $O(w(w + ul \log p))$. Thus, if $l$ and $p$ are fixed, VC-dim($H$) = $O(w^2)$.

As you can see, what was thought of as a more complex class is actually not that much more complicated than a linear threshold network.

Proof of Corollary 2:

$t = O(W + \text{time needed to compute }\Phi)$. From this, we can use Theorem 2. Thus, we need to first find the time necessary to compute $\Phi$. In order to compute $\Phi$, we must find:

  1. Correct piece to compute
  2. Compute the polynomial at that piece

In order to find the complexity of step 1, we can do a binary search on the pieces since they are sequentially organized, resulting in $O(\log p)$ complexity for finding the piece.

For step 2, the complexity of computing the polynomial at a piece, using well-known numerical methods, can be shown to be $O(l)$.

Thus, since both must be done for each of the $u$ computational nodes, it must be true that the time to compute the $\Phi$ is $O(ul \log p)$. Thus, the overall value of $t = O(w + ul \log p)$.

⚠️ **GitHub.com Fallback** ⚠️