06 Solutions to Overfitting (Neural Nets) - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

Neural Nets Overfitting

Given the huge amount of parameters (high model complexity) in a Neural Net, Neural Nets are more likely to run into the problem of overfitting. There are several techniques to mitigate the overfitting problems of Neural Nets.

-L2 Regularization and Weight Decay

L2 Regularization in Neural Nets shares the same idea as Ridge Regression (L2 Regularization of Linear Regression): Overfit models often rely on excessively large weights. We penalize models that have large weights by adding the squared L2 Norm of all weights to the Loss function.

$$ L_{reg} = L_{data} + \lambda \left\Vert W \right\Vert_{2}^2 $$

where:

$L_{reg}$ is the regularized Loss.
$L_{data} = \dfrac{1}{n} \sum_{i=1}^{n} L(y_i, \widehat{y}_i)$ is the original Loss (e.g., MSE, Cross Entropy).
$\lambda$ is the strength of the regularization penalty.
$\left\Vert W \right\Vert_{2}^2 = \sum_l \sum_{i,j}\left(W_{ij}^{(l)}\right)^2$ is the sum of all squared weights in the network (Norm 2 squared).

(Note: Capital $W$ represents the entire matrix of weights, while lowercase $w$ represents a single specific weight parameter during backpropagation).

The original gradient (w/o regularization) for a single weight $w$ is:

$$ \dfrac{\partial L}{\partial w} = \dfrac{\partial L_{data}}{\partial w} $$

The regularized gradient applies the power rule to the penalty term, resulting in:

$$ \dfrac{\partial L_{reg}}{\partial w} = \dfrac{\partial L_{data}}{\partial w} + 2\lambda w $$

The weight update using the original gradient is:

$$ w \leftarrow w - \alpha \left( \dfrac{\partial L_{data}}{\partial w} \right) $$

The weight update using the L2 regularized gradient becomes:

$$ \begin{alignat*}{3} w & \leftarrow w - \alpha \left( \dfrac{\partial L_{data}}{\partial w} + 2 \lambda w \right) && \Rightarrow\ w & \leftarrow (1 - 2 \alpha \lambda) w - \alpha \left( \dfrac{\partial L_{data}}{\partial w} \right) && \ \end{alignat*} $$

Because the learning rate ($\alpha$) and regularization parameter ($\lambda$) are chosen to be small positive fractions (ensuring $2\alpha\lambda < 1$), the multiplier $(1 - 2\alpha\lambda)$ is strictly between 0 and 1. This means the weight experiences exponential decay, proportionally shrinking by a tiny fraction at each update step (e.g., at every mini-batch for SGD, or every epoch for Full-batch Gradient Descent).

Note on Adam vs AdamW: While L2 Regularization and Weight Decay result in the exact same mathematical update rule in standard Stochastic Gradient Descent (SGD), they behave differently in adaptive optimizers like Adam. Modern frameworks use a specific optimizer called AdamW to ensure true weight decay is applied independently of the loss gradient. AdamW explicitly separates the two concepts: it calculates the normal data gradients using Adam, and then applies the proportional Weight Decay directly to the weights at the very end of the step.

-Dropout Regularization

Dropout is another highly effective regularization technique used to prevent overfitting in neural networks. Instead of modifying the loss function (like L2 Regularization), Dropout modifies the network architecture dynamically during training.

How Dropout Works During Training

At each training step (for every mini-batch), the network randomly "drops" a subset of neurons.

For each neuron in a layer, drop the neuron with a probability of $p$.
For a dropped neuron, set that neuron's activation to 0 (effectively removing it from the network for this step).
Continue the forward and backward passes normally using only the remaining active (not dropped) neurons.

The new activation $\tilde{a}_j$ is the original activation $a_j$ multiplied by a mask $m_j$:

$$ \tilde{a}_j = a_j \cdot m_j \quad \text{where } m_j \sim \text{Bernoulli}(1 - p) $$

(Note: $1-p$ is the probability of keeping the neuron, often called the "keep probability").

Common dropout rate: $p = 0.2$ to $0.5$

Training vs. Inference (Testing)

Because neurons are dropped during training, the network's behavior must be adjusted during inference so that predictions remain scaled correctly.

Training vs. Inference (Testing)

Because neurons are dropped during training, the network's behavior must be adjusted during inference so that predictions remain scaled correctly.

During Training:

Neurons are dropped randomly with probability $p$.
The network sees a different "thinned" version of itself at each batch.
Because fewer neurons are passing signals forward, the network learns larger weights for the remaining active neurons to compensate. The active neurons effectively "work harder" than they would if all neurons were present.

During Inference (Testing):

Use ALL neurons (turn dropout off).
Standard Scaling: Because all neurons are now working (and they were trained to output stronger signals), the overall output will be artificially high. To compensate, all activations must be scaled down by $(1 - p)$.

The Statistical Reason (Expected Value) Why scale by exactly $(1-p)$? Neural networks rely on the Expected Value ($\mathbb{E}$) of their activations remaining consistent between training and testing so that internal summations do not explode.

Let an original neuron's activation be $a$. During training, we apply a mask $m$ that is $1$ (with keep probability $1-p$) or $0$ (with drop probability $p$). The expected value of the activation during training is:

$$ \mathbb{E}[a_{\text{train}}] = (a \times P(\text{keeping})) + (0 \times P(\text{dropping})) $$

$$ \mathbb{E}[a_{\text{train}}] = a(1-p) + 0(p) = \mathbf{a(1-p)} $$

During inference, there is no randomness, so the output is simply $a$. To make the inference activation match the expected scale it learned during training, we must artificially apply that scaling factor:

$$ a_{\text{inference}} = \mathbf{a(1-p)} $$

The Modern Approach: "Inverted Dropout" Instead of scaling at inference (which slows down prediction times), modern deep learning frameworks (like PyTorch and Keras) use Inverted Dropout:

During training, the activations of the kept neurons are artificially boosted by dividing them by $(1 - p)$.
Benefit: Absolutely no scaling or math changes are needed at inference time, making predictions mathematically consistent and computationally faster!

Why Dropout Works

There are two primary theories for why this randomized dropping prevents overfitting so effectively:

Theory 1: Prevents Co-adaptation

Because neurons are randomly turned off, they cannot rely on the outputs of specific neighboring neurons to fix their mistakes.
Each neuron is forced to learn independently useful features.
It forces the network to create redundant representations of the data, making the model highly robust.

Theory 2: Implicit Ensemble

Every time a unique set of neurons is dropped, the model becomes a slightly different "thinned" network.
Training across many epochs means the network is essentially training exponentially many different smaller networks simultaneously.
When all neurons are turned back on during inference, it acts like an ensemble average of all those smaller networks' predictions.

Dropout in Practice

Where to apply it:

Typically: After Fully Connected (Dense) layers.
Sometimes: After Convolutional layers (though less common, and usually requires specialized spatial dropout).
Never: On the Output layer (you don't want to randomly drop your final predictions!).

Typical Dropout Rates ($p$):

Fully Connected (Dense) layers: $0.5$ (Drops 50% of neurons)
Convolutional layers: $0.2$ to $0.3$ (Drops 20-30% of neurons)

-Early Stopping

Early Stopping is a highly intuitive and widely used regularization technique. Unlike Dropout or L2 Regularization that modify the network's internal math or architecture, Early Stopping simply dictates when to stop the training process to prevent the model from overfitting.

When training a neural network, we continuously monitor two metrics over time (epochs): Training Loss and Validation Loss.

Underfitting Phase (Early Training): At the beginning, the model is learning general, useful patterns. Both the Training Loss and the Validation Loss decrease together.
The "Sweet Spot": At a certain epoch, the Validation Loss will reach its absolute minimum. This is the point where the model generalizes to unseen data the best.
Overfitting Phase (Late Training): If training continues past the sweet spot, the model exhausts the general patterns and begins to memorize the specific noise and outliers of the training data.
- The Training Loss continues to decrease (getting closer and closer to zero).
- The Validation Loss stops decreasing and actually begins to increase.

The Goal: The fundamental idea of Early Stopping is to monitor the validation loss and halt the training process right at that "sweet spot" before the validation error starts rising.

Implementing Early Stopping

In reality, validation loss doesn't form a perfectly smooth "U" curve; it fluctuates up and down slightly from epoch to epoch due to mini-batch noise. If we stopped the very first time the validation loss went up by a tiny fraction, we might stop too early and miss out on further learning.

To solve this, we introduce a hyper-parameter called Patience.

Patience: How many epochs to wait (after the validation loss stops improving) before officially killing the training process.

The Early Stopping Algorithm:

Track the lowest (best) validation loss seen so far.
If validation loss improves (drops below the current best), save the model's current weights and reset the patience counter to 0.
If there is no improvement, increment the patience counter. If the counter reaches the patience limit (e.g., no improvement for 5 consecutive epochs), stop training.
Restore best weights: Discard the current weights and reload the saved weights from the epoch that had the best validation loss.

Key Parameters to Configure:

Monitor: What metric to track (usually val_loss / validation loss).
Patience: How many epochs to wait for an improvement (typically set between 5 to 10 epochs, depending on dataset size and learning rate).
Restore best weights: A crucial setting (True). If you don't restore the best weights, your final model will be the slightly overfitted version from the exact moment training stopped, rather than the optimal version from several epochs prior!

-Batch Normalization

As a neural network trains, the weights of the earlier layers constantly change. This causes the distribution of the inputs to the later layers to shift constantly—a problem known as Internal Covariate Shift. Because the later layers have to continuously adapt to new input distributions, training becomes slow, unstable, and highly sensitive to weight initialization.

Batch Normalization (BN) solves this by normalizing the inputs of each layer so they have a stable mean and variance across every mini-batch. It is typically applied right after the linear transformation (matrix multiplication) but before the nonlinear activation function.

The Mathematical Process

For a given layer, let $\mathcal{B} = {x_1, x_2, \dots, x_m}$ be the input values to the Batch Normalization layer across a mini-batch of size $m$.

💡 Note on Notation: In standard BN formulas, $x$ and $y$ are used as generic inputs and outputs. Because BN is typically applied before the activation function, the BN input $x$ actually represents the pre-activation $z$ (where $z = W \cdot a_{prev}$ ). The BN output $y$ is then passed into the activation function to become $a$ (i.e., $a = \text{ReLU}(y)$ ).

Batch Normalization performs the following four steps:

1. Compute Mini-Batch Mean:

$$ \mu_{\mathcal{B}} = \dfrac{1}{m} \sum_{i=1}^{m} x_i $$

2. Compute Mini-Batch Variance:

$$ \sigma_{\mathcal{B}}^2 = \dfrac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2 $$

3. Normalize:

$$ \hat{x}i = \dfrac{x_i - \mu{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} $$

(Note: $\epsilon$ is a tiny constant, e.g., $10^{-5}$, added to the variance to prevent division by zero).

4. Scale and Shift (Learnable Parameters):

$$ y_i = \gamma \hat{x}_i + \beta $$

Why do we need $\gamma$ and $\beta$?

If we only performed Step 3, the output of the layer would always be strictly forced to have a mean of 0 and a variance of 1. This could severely restrict the representational power of the network (for example, it might force the inputs to a Sigmoid function exclusively into its linear region, destroying the nonlinearity).

By introducing the learnable parameters Scale ($\gamma$) and Shift ($\beta$), we give the network the ability to stretch and shift the normalized values to whatever distribution is actually optimal for the network. If the network decides the original, un-normalized distribution was better, it can simply learn $\gamma = \sigma_{\mathcal{B}}$ and $\beta = \mu_{\mathcal{B}}$ to perfectly undo the normalization!

Training vs. Inference (Testing)

Batch Normalization behaves very differently during training compared to inference.

During Training:

The mean $\mu_{\mathcal{B}}$ and variance $\sigma_{\mathcal{B}}^2$ are calculated using the current mini-batch.
Behind the scenes, the layer keeps a running tally (an exponentially moving average) of all the batch means and variances it has seen so far.

During Inference (Testing):

You might only be predicting on a single sample, so calculating a "batch mean" doesn't make sense.
Instead, the network completely stops calculating batch statistics. It uses the running averages of the mean and variance that it saved during training to normalize the new data.

Batch Normalization in Practice: Leaving out Biases

A standard neural network layer computes $z = Wx + b$. However, if a layer is immediately followed by Batch Normalization, the bias term ($b$) should be removed entirely (e.g., nn.Linear(..., bias=False) in PyTorch).

Why? Because Batch Normalization subtracts the mean in Step 3, any constant bias $b$ added in the previous step gets completely canceled out. The BN layer's own shift parameter ($\beta$) takes over the exact same role as the bias, making the original bias term a mathematically useless, redundant parameter that wastes memory.

Key Benefits of Batch Normalization

Allows Higher Learning Rates: Because activations are prevented from blowing up or vanishing, the network is much more stable and can handle larger learning rates, drastically speeding up convergence.
Mitigates Gradient Saturation: It keeps pre-activation values in the "sweet spot" (active regions) of activation functions like Sigmoid or Tanh, preventing gradients from becoming near-zero.
Less Sensitive to Initialization: The network becomes much more robust to bad initial weight choices.
Acts as a Mild Regularizer: Because each mini-batch has slightly different statistics, a small amount of "noise" is added to the activations during training. This creates a mild regularizing effect, sometimes reducing the need for heavy Dropout.