[25.05.31] Adam: A Method for Stochastic Optimization - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: Adam: A Method for Stochastic Optimization
Authors: Diederik P. Kingma, Jimmy Lei Ba
Published In: Conference paper at ICLR 2015
Year: 2015 (arXiv preprint v1: Dec 2014, v9: Jan 2017)
Link: arXiv:1412.6980
Date of Discussion: 2025.05.31

Summary

Research Problem: The paper addresses the need for an efficient stochastic optimization algorithm suitable for problems with large datasets and/or high-dimensional parameter spaces, particularly in the context of deep learning. It aims to handle noisy and/or sparse gradients and non-stationary objectives effectively.
Key Contributions:
1. Introduction of Adam, an algorithm that computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.
2. Combines the advantages of AdaGrad (handles sparse gradients well) and RMSProp (works well in online and non-stationary settings).
3. Introduces a bias-correction step for the moment estimates to counteract their initialization at zero, which is particularly important during initial timesteps and with high decay rates (betas close to 1).
Methodology/Approach: Adam maintains an exponentially decaying average of past gradients (1st moment, m_t) and past squared gradients (2nd moment, v_t). These moments are bias-corrected. The parameter update is then performed by scaling the (bias-corrected) first moment by the inverse of the square root of the (bias-corrected) second moment, similar to RMSProp/AdaGrad, with an overall step size alpha.
Results: Empirical results on logistic regression, multilayer neural networks, and convolutional neural networks demonstrate that Adam works well in practice, often outperforming other stochastic optimization methods like SGD with Nesterov momentum, AdaGrad, and RMSProp. The bias-correction term is shown to be beneficial, especially with beta2 values close to 1.

Discussion Points

Strengths:
- The algorithm is conceptually elegant and relatively simple to implement, especially if one understands its predecessors (AdaGrad, Momentum). (04:08, 04:15)
- The bias correction mechanism is a crucial and well-justified component, particularly for early training stability. (05:07, 06:30, 15:23)
- The effective step size alpha provides an approximate upper bound on the magnitude of parameter updates, which can be interpreted as establishing a "trust region." (07:24 - 07:58)
- The paper is well-written, clear, and "without unnecessary parts." (00:55, 41:07)
- Adam's design effectively handles sparse gradients and non-stationary objectives. (Referenced from paper, discussed implicitly)
Key Questions (raised during discussion):
- The practical interpretation of setting alpha based on an expected distance to the optima. (09:48, 10:13)
- The precise, distinct roles and sensitivities of beta1 vs. beta2 in the update rule and bias correction. (16:13, 21:14)
- Why beta2 = 0 (no second moment accumulation) wasn't explicitly tested or discussed in the bias-correction experiments (Figure 4), though it was reasoned that this would be similar to SGD and thus not the focus. (33:28, 36:56)
Applications: Adam has become a de-facto standard optimizer for training deep neural networks across various domains (image classification, NLP, etc.).
Connections:
- Directly builds upon and combines ideas from AdaGrad (adaptive learning rates, good for sparse gradients) and RMSProp (handles non-stationary objectives, uses squared gradients). (03:45)
- Incorporates momentum (via the first moment estimate m_t). (00:09)
- Compared against SGD with Nesterov momentum.

Notes and Reflections

Interesting Insights:
- The "design philosophy" behind Adam, particularly the bias correction, was a key learning point. (00:09)
- The step size alpha acting as an approximate bound for updates (|Δt| <= alpha in many common scenarios) and the "trust region" interpretation. (07:58, 08:28)
- The signal-to-noise ratio (SNR) interpretation (m_t / sqrt(v_t)) and how it leads to smaller steps as optima are approached (automatic annealing). (13:54)
- Bias correction is critical in early steps to prevent moment estimates (initialized at 0) from being too small, especially when beta values are close to 1 (slow decay of past information). (15:23, 24:58)
Lessons Learned:
- Understanding the motivation and impact of each component (like bias correction) is vital, not just the final algorithm.
- Even widely used and seemingly "solved" methods like Adam have nuances that are worth exploring deeply.
- Visualizing or stepping through the math with concrete examples (e.g., effect of large/small beta on bias-corrected terms) aids understanding. (21:14 - 25:15)
Future Directions (as per paper/discussion):
- The paper itself introduces AdaMax as an extension. (39:29)
- The discussion acknowledged that Adam is a foundational method, and subsequent work (like AdamW) has built upon it. (28:44)