Gradient Descent (GD) |
The foundational algorithm. Applied to find the minimum of a loss function by iteratively moving in the opposite direction of the gradient. |
$w_{new} = w_{old} - \eta \nabla L(w_{old})$ |
First-order |
Learning Rate ($\eta$) |
Conceptually simple, guaranteed convergence for convex functions. |
Computationally expensive for large datasets (requires full pass), can get stuck in local minima. |
Small datasets, understanding the core optimization concept. |
Generally attributed to Cauchy (1847) for solving astronomical problems. |
Stochastic Gradient Descent (SGD) |
Developed to address the computational cost of GD on large datasets by updating parameters based on a single training example at a time. |
$w_{new} = w_{old} - \eta \nabla L(w_{old}; x^{(i)}, y^{(i)})$ |
First-order, Stochastic |
Learning Rate ($\eta$) |
Much faster than GD on large datasets, can escape shallow local minima due to noise. |
Noisy updates can cause oscillations, can be slower to converge precisely, sensitive to learning rate. |
Large datasets, online learning. |
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400-407. |
Mini-Batch Gradient Descent (MB-SGD) |
A compromise between GD and SGD to balance computational efficiency and the noise in updates. Updates are based on a small batch of examples. |
$w_{new} = w_{old} - \eta \nabla L(w_{old}; \text{batch})$ |
First-order |
Learning Rate ($\eta$), Batch Size |
More stable updates than SGD, more computationally efficient than GD, widely used in practice. |
Requires tuning of batch size, can still oscillate. |
Most deep learning tasks, standard practice. |
N/A (A widely adopted practical variation, not tied to a single foundational paper) |
SGD with Momentum |
Introduced to accelerate SGD by adding a fraction of the previous update vector to the current update. Helps overcome inertia and oscillations. |
$v_{new} = \gamma v_{old} + \eta \nabla L(w_{old})$$w_{new} = w_{old} - v_{new}$ |
First-order, Momentum |
Learning Rate ($\eta$), Momentum ($\gamma$) |
Accelerates convergence, dampens oscillations, helps navigate flat regions and escape shallow local minima. |
Requires tuning of momentum coefficient, can overshoot the minimum. |
When faster convergence is needed, overcoming oscillations in SGD/MB-SGD. |
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. Parallel distributed processing: Explorations in the microstructure of cognition, 1(3), 318-362. |
Nesterov Accelerated Gradient (NAG) |
An improvement over standard momentum that calculates the gradient at a "lookahead" point in the direction of the momentum. |
$v_{new} = \gamma v_{old} + \eta \nabla L(w_{old} - \gamma v_{old})$$w_{new} = w_{old} - v_{new}$ |
First-order, Momentum |
Learning Rate ($\eta$), Momentum ($\gamma$) |
Often converges faster than standard momentum, provides a more direct path to the minimum. |
Still requires tuning of hyperparameters. |
Similar to SGD with Momentum, can offer slight performance gains. |
Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate $O(1/k^2)$. Doklady Akademii Nauk SSSR, 269, 543-547. |
AdaGrad (Adaptive Gradient) |
Developed to adapt the learning rate for each parameter individually, scaling it inversely to the history of squared gradients. Useful for sparse data. |
$G_{new} = G_{old} + (\nabla L(w_{old}))^2$ (element-wise)$w_{new} = w_{old} - \frac{\eta}{\sqrt{G_{new} + \epsilon}} \nabla L(w_{old})$ |
Adaptive Learning Rate |
Learning Rate ($\eta$), Epsilon ($\epsilon$) |
Adapts learning rates per-parameter, effective for sparse data where gradients vary greatly. |
Learning rate can become vanishingly small over time, slowing down convergence significantly. |
Sparse data, natural language processing tasks. |
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159. |
RMSProp (Root Mean Square Propagation) |
Addressed AdaGrad's rapidly decreasing learning rate by using an exponentially decaying average of squared gradients instead of the cumulative sum. |
$E[g^2]{new} = \gamma E[g^2]{old} + (1 - \gamma) (\nabla L(w_{old}))^2$$w_{new} = w_{old} - \frac{\eta}{\sqrt{E[g^2]{new} + \epsilon}} \nabla L(w{old})$ |
Adaptive Learning Rate |
Learning Rate ($\eta$), Decay Rate ($\gamma$), Epsilon ($\epsilon$) |
Prevents learning rate from decaying too quickly, generally works well in practice. |
Can still struggle with converging to the optimal solution in some cases. |
Non-stationary objectives, frequently used in deep learning. |
Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. |
AdaDelta |
An extension of AdaGrad that removes the need for a manual learning rate by using the ratio of the exponential decay average of previous updates and squared gradients. |
$\Delta w = - \frac{\sqrt{E[\Delta w^2]{old} + \epsilon}}{\sqrt{E[g^2]{new} + \epsilon}} \nabla L(w_{old})$$E[\Delta w^2]{new} = \gamma E[\Delta w^2]{old} + (1 - \gamma) (\Delta w)^2$$w_{new} = w_{old} + \Delta w$ |
Adaptive Learning Rate |
Decay Rate ($\gamma$), Epsilon ($\epsilon$) |
Does not require setting a learning rate, robust to different scales of gradients. |
Can be less intuitive than optimizers with a learning rate. |
Similar to RMSProp, when learning rate tuning is difficult. |
Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701. |
Adam (Adaptive Moment Estimation) |
Combines the concepts of momentum and RMSProp, using exponentially decaying averages of both the first and second moments of gradients with bias correction. |
$m_{new} = \beta_1 m_{old} + (1 - \beta_1) \nabla L(w_{old})$$v_{new} = \beta_2 v_{old} + (1 - \beta_2) (\nabla L(w_{old}))^2$$\hat{m}{new} = \frac{m{new}}{1 - \beta_1^{t}}$$\hat{v}{new} = \frac{v{new}}{1 - \beta_2^{t}}$$w_{new} = w_{old} - \frac{\eta}{\sqrt{\hat{v}{new}} + \epsilon} \hat{m}{new}$ |
Adaptive Learning Rate, Momentum |
Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) |
Generally performs well and is robust to hyperparameter choices, fast convergence, handles sparse gradients. |
Can sometimes converge to a suboptimal solution compared to SGD with fine-tuning. |
Most deep learning tasks, a good default choice. |
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. |
Adamax |
A variant of Adam that uses the infinity norm for the second moment, which can be more stable. |
Similar to Adam, but $v_{new} = \max(\beta_2 v_{old}, |
\nabla L(w_{old}) |
)$$w_{new} = w_{old} - \frac{\eta}{v_{new} + \epsilon} \hat{m}_{new}$ |
Adaptive Learning Rate, Momentum |
Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) |
More stable with large parameter values compared to Adam. |
Less commonly used than Adam, performance gains may be marginal. |
Nadam (Nesterov + Adam) |
Combines Adam with the Nesterov accelerated gradient, incorporating the "lookahead" property. |
Incorporates Nesterov momentum into the Adam update rules. |
Adaptive Learning Rate, Momentum |
Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) |
Often provides slightly faster convergence and potentially better performance than Adam. |
Slightly more complex formulation than Adam. |
Similar to Adam, can be tried for potential improvements. |
Dozat, T. (2016). Incorporating Nesterov momentum into Adam. Available at http://cs229. stanford. edu/proj/28/NgoTruongSon_Report. pdf. |
AMSGrad |
A modification of Adam that ensures the second moment estimate is non-decreasing to address potential convergence issues. |
Uses $\hat{v}{new} = \max(\hat{v}{old}, v_{new})$ in the Adam update. |
Adaptive Learning Rate, Momentum |
Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) |
Provides stronger theoretical convergence guarantees, can sometimes outperform Adam. |
Empirical performance gains over Adam are not always significant and can be problem-dependent. |
When stable convergence is critical, or Adam shows unstable behavior. |
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237. |
Lookahead |
A wrapper that improves the convergence and generalization of inner optimizers by using "slow" and "fast" weights. |
Periodically updates "slow" weights by interpolating towards "fast" weights optimized by an inner optimizer (e.g., Adam or SGD). |
Wrapper, Meta-optimizer |
Lookahead steps (k), Alpha ($\alpha$) (interpolation factor) |
Can lead to faster convergence and potentially better generalization, can be applied with existing optimizers. |
Adds complexity, requires tuning of additional hyperparameters. |
Improving the performance of base optimizers. |
Zhang, F., Song, Y., Gao, S., Ren, Z., Xing, E. P., & Ho, Q. (2019). Lookahead optimizer: k steps forward, 1 step back. Advances in Neural Information Processing Systems, 32. |
RAdam (Rectified Adam) |
Addresses Adam's poor performance in early training steps due to the variance of the adaptive learning rate. |
Modifies the Adam update rule based on the variance of the adaptive learning rate, especially in early timesteps. |
Adaptive Learning Rate, Momentum |
Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) |
More stable and reliable training in the early stages, less sensitive to initial learning rate choice. |
Can be slightly more complex to implement than standard Adam. |
Training deep networks from scratch, when Adam struggles initially. |
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., ... & Zhao, J. (2020). On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. |
Fromage |
A more recent technique that leverages functional analysis to determine update directions, aiming for improved robustness. |
Uses functional norms and distances between probability distributions to guide parameter updates. (Complex mathematical formulation beyond simple equations). |
Functional Analysis-based |
Learning Rate ($\eta$), various internal parameters |
Aims for improved robustness and navigation of complex loss landscapes, theoretically grounded in functional analysis. |
Newer technique, less widely adopted and tested compared to established optimizers, can be computationally intensive. |
Research and experimentation, potentially for challenging optimization problems. |
Défossez, A., & Bach, F. (2021). Fromage: Learning on manifolds with functional distances. Advances in Neural Information Processing Systems, 34, 25363-25376. |