CS7545_Sp23_Lecture_25: Introduction to Complexity of Neural Networks - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler LaBonte

Notes for Lecture 25

April 18, 2023

Scribes: Kevin Rojas, Xuhan Zhao

Recap of key concepts in VC-dimension

We start by doing a small recap of some definitions and results that we've studied before.

Recall that if we have $H \subset { h : X \to {-1,1} }$ a set of classifier functions and $S = {x_i}_{i=1}^m$ a set of samples. We can define:

$$ H |_S = { (h(x_1), \dots, h(x_m)) : h \in H } $$

Definition (Growth Function) We can define the growth function as :

$$ \Pi_H(m) = \max\limits_{|S| = m, S \subset X} | H|_S | $$

Definition (Shattering) We say that a set $S \subset X$ is shattered if and only if:

$$ H|_S = 2^m$$

Or the same thing if $\Pi_H(m) = 2^m$.

Definition(VC-Dimension) We define the VC dimension as:

$$ \mathop{VC-dim}(H) = \max {m : \Pi_H(m) = 2^m } $$

Thm (Sauer-Shelah Lemma) If $d = \mathop{VC-dim}(H)$ then for all $m\geq d$ :

$$ \Pi_H(m) \leq \left( \frac{em}{d} \right)^d = \mathcal{O}(m^d) $$

Thm For any $m,\delta > 0$ there exists a constant $C>0$ such that with probability greater than $1-\delta$ over $S\sim D^m$ it holds that for all $h\in H$:

$$L(h) \leq \hat L_S (h) + C \sqrt{\frac{d}{m}} + \sqrt{\frac{\log 1/\delta}{2m}} $$

Neural Networks

So what is really a Neural Network (NN). Simply speaking it is a directed acyclic graph (DAG) :

Screenshot from 2023-04-20 21-19-38

Every edge $e$ has a weight $w_e \in \mathbb{R}$ and every node $r$ has a bias $\theta_r \in \mathbb{R}$. If $r$ a node receives inputs $z_1,\dots z_k$ then the output of $r$ is :

$$\phi\left(\sum_{e = 1}^k w_e \cdot z_e - \theta_r \right)$$

For some activation function $\phi : \mathbb{R} \to \mathbb{R}$.

For example:

Screenshot from 2023-04-20 21-21-52

Before moving on we would like to start with some basic definitions for NNs.

  • $U=$ the number of computation nodes (excluding inputs) ($U=7$ in our example above)
  • $W=$ the number of parameters (weights and biases) ($W = 21$ in our example above)
  • The depth $D=$ number of layers (excluding input) ($D = 3$ in our example above)

We can set $H$ as the set of functions that can be computed by this model. It turns out that defining it this way the $\mathop{VC-dim}$ depends mostly on $\phi$ the activation function. The following table summarizes some known results:

$\phi$ $\mathcal{O}(\mathop{VC-dim}(H))$
Piecewise functions (ex. sign of x) $W \log U$
Piecewise Linear (ex. Relu) $WU, WD\log U$
Sigmoid $(WU)^2$
General $\infty$

Where the last one can be achieved with an activation function $\phi(x) = \mathop{sigmoid}(x) + c \sin x$. Notice that this suggests VC dimension is not very good at explaining the complexity of NNs in practice as $W$ is usually very large. One of the main drawbacks is that this is a static formulation of complexity, but when we include optimization (usually bia gradient descent or stochastic gradient descent) we restrict ourselves to a subset of possible classifiers.

Another important thing to notice is that bigger VC dimension increases the range of possible functions you can represent but it difficults finding the right function more. So there is a tradeoff between increasing or decreasing your VC dimension. This is essentially the bias-variance tradeoff.

Bounds for binary-valued activation networks

Thm: Assume that each activation function computes a binary-valued function. Let $\mathcal{H}_i$ be the class of hypothesis computed by node $u_i$.

Denote $$d_i = \mathop{VC-dim}(\mathcal{H}_i)$$

$$d = \sum_{i=1}^U d_i$$

Then for $m \geq \max d_i$, $$\Pi_{\mathcal{H}} (m) \leq \prod_{i=1}^U \Pi_{\mathcal{H}_i} (m) \leq \left(\frac{emU}{d}\right)^d$$.

Cor: Suppose all activations are linear thresholds (e.g. $sign(w^Tx$)). Then for $m \geq W$,

$$\Pi_{\mathcal{H}}(m) \leq \left(\frac{emU}{w}\right)^w$$

which implies that:

$$\mathop{VC-dim}(\mathcal{H}) \leq 2W\log_2 \left(\frac{2eU}{\log 2}\right) = \mathcal{O}(W\log U)$$

Proof of corollary from the theorem:

Recall if $\mathcal{G}$ is a class of linear thresholds, i.e. hyperplanes in $\mathbb{R}^k$, then $\mathop{VC-dim} (\mathcal{G}) = k+1$. Let $deg(u)$ denote the in-degree of node $u$. Then

$$\begin{align*} d &= \sum_{i=1}^U \mathop{VC-dim} (\mathcal{H}_i) \\ &\leq \sum_{i=1}^U (deg(u_i) + 1) \\ &= \sum_{i=1}^U deg(u_i) + \sum_{i=1}^U 1 \\ &= W \end{align*}$$

As we wanted.

Proof of theorem:

  1. We first want to show $$\Pi_{\mathcal{H}}(m) \leq \prod_{i=1}^U \Pi_{\mathcal{H}_i}(m)$$ before doing this recall that:

Definition (Topological Ordering) Given a DAG $G$ a topological ordering of $G$ is an ordering $u_1, u_2, \dots, u_u$ of the nodes such that $u_i$ receives inputs only from inputs nodes and $u_j$, with $1 \leq j \leq i$.

Since the feed forward neural network is acyclic, it's guaranteed that there exists a topological ordering of all the nodes. Let $u_1, u_2, \dots, u_u$ be a topological ordering of the nodes in the neural network.

The key idea is that the output $u_i$ depends only on :

  1. Input nodes
  2. The output from $u_j$, with $1 \leq j \leq i$
  3. Functions computed at $u_i$

Therefore since we only depend on previously computed things, we can perform induction on $i$. By definition for $u_1$, there are at most $\Pi_{u_1}(m)$ ways in which $u_1$ could classify $S$. (Notice that here we are making use that the functions are binary classifiers)

Similarly since $u_2$ may receive inputs only from the input and $u_1$. If these are fixed, there are at most $\Pi_{u_2}(m)$ ways that $u_2$ could classify $S$. Therefore, the network up to $u_2$ can classify $S$ is bounded by $\Pi_{u_1}(m) * \Pi_{u_2}(m)$ as we proved in one of the homeworks.

By induction, $$\Pi_{\mathcal{H}}(m) \leq \prod_{i=1}^U \Pi_{\mathcal{H}_i}(m)$$

and $\mathop{VC-dim}(\mathcal{H}) = \mathcal{O}(W\log U)$.

  1. Then we want to prove the second half of the inequality, namely $$\prod_{i=1}^U \Pi_{\mathcal{H}_i} (m) \leq \left(\frac{emU}{d}\right)^d$$

Since $m \geq \max_i d_i$, we can apply Sauer-Shelah Lemma to each of the terms and have $$\prod_{i=1}^U \Pi_{\mathcal{H}_i} (m) \leq \prod_{i=1}^U \left(\frac{em}{d_i}\right)^{d_i} $$

Since the product is difficult to work with, we take $\log$ and equivalently we want to bound $$\sum_{i=1}^U d_i \log \left(\frac{em}{d_i}\right)$$

We can perform affine scaling using the fact that $$\sum_{i=1}^U \frac{d_i}{d} = 1$$

We then notice that:

$$\begin{align*} &\frac{1}{d} \sum_{i=1}^U d_i \log \left(\frac{em}{d_i}\right) + \log \left(\frac{d}{em} \right) \\ &= \sum_{i=1}^U \frac{d_i}{d} \log \left(\frac{em}{d_i}\right) + \sum_{i=1}^U \frac{d_i}{d} \log \left(\frac{d}{em} \right) \\ &= \sum_{i=1}^U \frac{d_i}{d} \log \left(\frac{d}{d_i}\right) \end{align*}$$

So we ended up bounding this by an entropy term, which is easily bounded. We will finish this next class.

⚠️ **GitHub.com Fallback** ⚠️