05 Gradient Descent Optimization - chanchishing/Introduction-to-Deep-Learning GitHub Wiki
Full-Batch Gradient Descent
Full-Batch Gradient Descent uses the entire data set for each gradient update in a single epoch (one complete pass of all training data points), it is accurate but slow as we need to wait gradient of all data points is calculated to update weights.
$$ w \leftarrow w-\alpha \dfrac{1}{n} \sum_{i=1}^n \nabla L_i $$
Stochastic Gradient Descent (one random sample)
In a single epoch, Stochastic Gradient Descent picks one random sample from dataset to calculate gradient to update weight then pick another sample (without replacement) to calculate gradient and update weight again until all data points are used to update weight. It is fast as for each update of weights we only need to calculate the gradient of a single data point but each step may go in the wrong direction
$$ w \leftarrow w - \alpha \nabla L_i~~~~~\text{(one random sample)} $$
Mini-batch Gradient Descent
Mini-batch Gradient Descent is a intermediate between Full-batch Gradient Descent and Stochastic Gradient Descent (one random sample). In a single epoch, $B$ samples (typically between 32 to 256) are drawn from dataset is used to calculate gradient and update weights and then another $B$ sample (without replacement) are drawn to update weights the process is repeated until all data point is used to update weight once. Mini-batch will direct the descent in the general direction of minima with some noise(not the prefect gradient direction of full dataset), however, the noise in mini-batch may help to escape local minima.
$$ w \leftarrow w - \alpha \dfrac{1}{B} \sum_{i \in batch} \nabla L_i $$
Momentum
In Gradient Descent with Momentum, instead of stepping proportionally to the current gradient, we accumulate gradients using the momentum coefficient $\beta$ (typically $0.9$) to calculate velocity $v$. Adding Momentum has the advantage that the descent will less likely to oscillate in shallow gradient as there is a higher tendency to travel in the same direction of the last step.
$$ v_t=\beta v_{t-1} +\nabla L(w_t) $$
$$ w_{t+1}=w_t - \alpha v_t $$
Adam: Adaptive Moment Estimation
Adam computes adaptive learning rates for each parameter. It tracks the moving averages of both the gradient and its square to adjust the step size dynamically.
Step 1: Accumulate Moments
At each timestep $t$, we calculate the gradient $g_t$ (which is $\nabla L(w_t)$ ) and update the two moments:
Note: $g_t$, $m_t$, and $v_t$ are vectors with the same shape as the weights. All operations below (multiplication, squaring) are performed element-wise, meaning every single weight has its own specific $m_t$ and $v_t$ value.
-
First Moment ($m_t$ - The Momentum): Tracks the average direction.
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1)g_t$$
-
Second Moment ($v_t$ - The Scaling): Tracks the average magnitude (uncentered variance).
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2$$
Step 2: Bias Correction
Since $m_t$ and $v_t$ are initialized at $0$, they are biased toward zero during the early stages of training. We correct them using:
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad \text{and} \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
Step 3: Parameter Update
The weights are updated by combining the "direction" ( $\hat{m}_t$ ) and the "scale" ( $\sqrt{\hat{v}_t} $):
$$w_{t+1} = w_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$
2. Key Hyperparameters (Standard Defaults)
| Parameter | Recommended Value | Description |
|---|---|---|
| $\alpha$ | 0.001 | Learning Rate: The global step size. |
| $\beta_1$ | 0.9 | Momentum decay: How much we trust past directions. |
| $\beta_2$ | 0.999 | Scaling decay: How much we trust past gradient magnitudes. |
| $\epsilon$ | 1E-8 | Epsilon: A tiny value to prevent division by zero. |
3. Why use Adam over standard Momentum?
- Adaptive per-parameter: While the learning rate $\alpha$ is fixed globally, Adam divides it by $\sqrt{\hat{v}_t}$ separately for each weight. This means parameters with large gradients get smaller effective steps (for stability), and parameters with small gradients get larger steps (for speed).
- Handles Sparse Gradients: Because it divides by the second moment, weights that rarely get updated receive a "boost" when they finally do.
- Automatic "Braking": If gradients become very volatile (large variance), the denominator $\sqrt{\hat{v}_t}$ increases, which automatically shrinks the step size to keep training stable.
Note: Some research shown SGD+Momentum sometimes find better minima (especially in image classification). In practice, it is suggest to try Adam first and if there is signs of overfitting, then try SGD+Momentum with tuning.
Learning Rate and Batch Size
Learning Rate
- Too Small: Loss decreases very slowly; may get stuck in local minima or saddle points.
- Too Large: Loss oscillates (increases and decreases) widely; may fail to converge or even diverge (go to infinity).
Batch Size Trade-off
-
Small (8-32):
- Noisy Gradient: The estimate has high variance. This noise acts as a regularizer, often helping the model generalize better.
- Memory: Requires less GPU memory.
- Updates: More updates per epoch (weights change frequently).
-
Large (128-512+):
- Accurate Gradient: The estimate is smoother/more stable.
- Memory: Requires more GPU memory.
- Updates: Fewer updates per epoch (weights change less frequently).
- Learning Rate: Can use a larger learning rate (less likely to oscillate since the gradient estimate is more accurate).
Heuristics (Rules of Thumb)
Linear Scaling Rule: When training with large batches, a common practice (for SGD) is to scale the learning rate linearly. If you multiply the batch size by $k$, multiply the learning rate by $k$.
Troubleshooting:
- If Loss does not decrease: Try a larger learning rate (you might be stuck or moving too slowly).
- If Loss oscillates: Try a smaller learning rate (you are overshooting the minimum).
- If training is computationally slow (time): Increase the batch size to better utilize GPU parallelism.