Thm For any $m,\delta > 0$ there exists a constant $C>0$ such that with probability greater than $1-\delta$ over $S\sim D^m$ it holds that for all $h\in H$:
So what is really a Neural Network (NN). Simply speaking it is a directed acyclic graph (DAG) :
Every edge $e$ has a weight $w_e \in \mathbb{R}$ and every node $r$ has a bias $\theta_r \in \mathbb{R}$. If $r$ a node receives inputs $z_1,\dots z_k$ then the output of $r$ is :
For some activation function $\phi : \mathbb{R} \to \mathbb{R}$.
For example:
Before moving on we would like to start with some basic definitions for NNs.
$U=$ the number of computation nodes (excluding inputs) ($U=7$ in our example above)
$W=$ the number of parameters (weights and biases) ($W = 21$ in our example above)
The depth $D=$ number of layers (excluding input) ($D = 3$ in our example above)
We can set $H$ as the set of functions that can be computed by this model. It turns out that defining it this way the $\mathop{VC-dim}$ depends mostly on $\phi$ the activation function. The following table summarizes some known results:
$\phi$
$\mathcal{O}(\mathop{VC-dim}(H))$
Piecewise functions (ex. sign of x)
$W \log U$
Piecewise Linear (ex. Relu)
$WU, WD\log U$
Sigmoid
$(WU)^2$
General
$\infty$
Where the last one can be achieved with an activation function $\phi(x) = \mathop{sigmoid}(x) + c \sin x$. Notice that this suggests VC dimension is not very good at explaining the complexity of NNs in practice as $W$ is usually very large. One of the main drawbacks is that this is a static formulation of complexity, but when we include optimization (usually bia gradient descent or stochastic gradient descent) we restrict ourselves to a subset of possible classifiers.
Another important thing to notice is that bigger VC dimension increases the range of possible functions you can represent but it difficults finding the right function more. So there is a tradeoff between increasing or decreasing your VC dimension. This is essentially the bias-variance tradeoff.
Bounds for binary-valued activation networks
Thm: Assume that each activation function computes a binary-valued function. Let $\mathcal{H}_i$ be the class of hypothesis computed by node $u_i$.
Denote $$d_i = \mathop{VC-dim}(\mathcal{H}_i)$$
$$d = \sum_{i=1}^U d_i$$
Then for $m \geq \max d_i$, $$\Pi_{\mathcal{H}} (m) \leq \prod_{i=1}^U \Pi_{\mathcal{H}_i} (m) \leq \left(\frac{emU}{d}\right)^d$$.
Cor: Suppose all activations are linear thresholds (e.g. $sign(w^Tx$)). Then for $m \geq W$,
Recall if $\mathcal{G}$ is a class of linear thresholds, i.e. hyperplanes in $\mathbb{R}^k$, then $\mathop{VC-dim} (\mathcal{G}) = k+1$. Let $deg(u)$ denote the in-degree of node $u$. Then
We first want to show $$\Pi_{\mathcal{H}}(m) \leq \prod_{i=1}^U \Pi_{\mathcal{H}_i}(m)$$ before doing this recall that:
Definition (Topological Ordering) Given a DAG $G$ a topological ordering of $G$ is an ordering $u_1, u_2, \dots, u_u$ of the nodes such that $u_i$ receives inputs only from inputs nodes and $u_j$, with $1 \leq j \leq i$.
Since the feed forward neural network is acyclic, it's guaranteed that there exists a topological ordering of all the nodes. Let $u_1, u_2, \dots, u_u$ be a topological ordering of the nodes in the neural network.
The key idea is that the output $u_i$ depends only on :
Input nodes
The output from $u_j$, with $1 \leq j \leq i$
Functions computed at $u_i$
Therefore since we only depend on previously computed things, we can perform induction on $i$. By definition for $u_1$, there are at most $\Pi_{u_1}(m)$ ways in which $u_1$ could classify $S$. (Notice that here we are making use that the functions are binary classifiers)
Similarly since $u_2$ may receive inputs only from the input and $u_1$. If these are fixed, there are at most $\Pi_{u_2}(m)$ ways that $u_2$ could classify $S$. Therefore, the network up to $u_2$ can classify $S$ is bounded by $\Pi_{u_1}(m) * \Pi_{u_2}(m)$ as we proved in one of the homeworks.
By induction, $$\Pi_{\mathcal{H}}(m) \leq \prod_{i=1}^U \Pi_{\mathcal{H}_i}(m)$$
and $\mathop{VC-dim}(\mathcal{H}) = \mathcal{O}(W\log U)$.
Then we want to prove the second half of the inequality, namely $$\prod_{i=1}^U \Pi_{\mathcal{H}_i} (m) \leq \left(\frac{emU}{d}\right)^d$$
Since $m \geq \max_i d_i$, we can apply Sauer-Shelah Lemma to each of the terms and have $$\prod_{i=1}^U \Pi_{\mathcal{H}_i} (m) \leq \prod_{i=1}^U \left(\frac{em}{d_i}\right)^{d_i} $$
Since the product is difficult to work with, we take $\log$ and equivalently we want to bound $$\sum_{i=1}^U d_i \log \left(\frac{em}{d_i}\right)$$
We can perform affine scaling using the fact that $$\sum_{i=1}^U \frac{d_i}{d} = 1$$