CS7545_Sp23_Lecture_27: Generalization Bounds are Vacuous in Deep Learning - mltheory/CS7545 GitHub Wiki

CS 7545: Machine Learning Theory -- Spring 2023

Instructor: Tyler LaBonte

Notes for Lecture 27

April 25, 2023

Scribes: Zhaoyu Xu, Mian Wu

Nonvacuous Generalization Bound for Neural Networks

A few reasons why it's important for us to study generalization bounds & complexity measures -

  • Provide insight & prediction of model behavior (explain weird things like the phenomenon of double descent, weaker performance when using distributed training, etc.)
  • Algorithms and regularization techniques
    1. Provable guarantees on our algorithms (even before training or testing on real-life tests, for instance, models for airplanes/nuclear reactors)
    2. Use complexity measure as a regularizer (the ideas behind some oldest ML models are directly minimizing things like the Rademacher generalization bounds, if we can find some the right complexity measure for NNs, then theoretically speaking, we should get better performance set of optimizers)
  • Rigorous mindset in the industry

Review of Classical Complexity Measures for Neural Networks

Let $\mathcal{F}$ be a real-valued function class for a feedforward Neural Network with ReLU activations, $W$ be the number of parameters, $U$ be the number of computation nodes, $D$ be the depth, $m$ be the number of data, then we have: $$\mathcal{H}=\text{sgn} \circ y$$ $$f(x)=w_D \cdot \phi\left(w_{D-1} \cdot \phi(\cdots w_1 x)\right)$$ And we have previously seen that -

$$ VC(\mathcal{H})=O(W D \log W)$$

$$ R(\mathcal{F})=O\left(\frac{1}{\sqrt{m}} 2^D \prod_{i=1}^D ||w_i||_F \right) $$

However, both are insufficient for modern Neural Networks, reasons are as follows:

  • For overparametrized network (i.e., $W \gg m$), the VC-dim complexity term $\sqrt \frac{VC}{m} \gg 1$

  • Modern NNs can be very deep, the Rademacher complexity is exponential in $D$ $(\gg 1)$

  • Gradient descent tends to increase $||w_i||$, then $\prod||w_i||_F$ could be large. We could use early stopping $\rightarrow$ Terminate model training before SGD convergence.

Emphasize: These bounds above are not wrong, they are just loose for practical models.

Regularization on Neural Networks

Example: Support Vector Machine(SVM), which tries to $\min{\frac{1}{2}||w||_2^2}$, s.t. you fit all the data.

Classification-of-data-by-support-vector-machine-SVM

We showed that the Rademacher Complexity for

$$\mathcal{F}= \left\{x \mapsto \omega^{\top} x:\|\omega\|_2 \leq B\right\}$$$

is

$$ R(\mathcal{F})=O\left( \frac{B}{\sqrt m}\right) $$

However, it seems that overparameterized Neural Networks may not need regularization to perform well.

Example: Although adding regularization could improve performances, but without regularization, NNs also get acceptable results (Zhang, et al., 2017).

  • CIFAR 10 dataset, Inception: with weight decay (equivalent to $\ell_2$ regularizer on the weights) 86.03% accuracy, without weight decay 85.75% accuracy

  • ImageNet dataset, Inception: with weight decay 67.18% accuracy, without weight decay 59.80% accuracy

Neural Network performance is nearly dependent on optimization.

  • SVM (will obtain the same solution with the following methods)
    • Solve QP exactly
    • GD algorithm
    • Coordinate descent
  • NNs trained using gradient descent
    • No unique solution
    • The way you train actually matters
    • GD finds a low-complexity solution all by itself (implicit regularization)

The Double Descent Phenomenon

Classical Bias-Variance Tradeoff Curve1

Classical Bias-Variance Tradeoff Curve

Double Descent Risk Curve (Belkin, et al., 2019)

Double Descent Risk Curve

Note: This naturally leaves us with a question to be further investigated - from the "modern" perspective, "What should be on the x-axis?", i.e., what would be an appropriate approach to reason about the generalization of (deep) neural networks.

⚠️ **GitHub.com Fallback** ⚠️