Neural Network Cheat Sheet - vaibhavmaurya/Documentations GitHub Wiki

Neural Network Optimization Techniques

Optimization Technique Description and Why it was Applied Mathematical Formulation Type Key Hyperparameters Advantages Disadvantages Typical Use Cases / When to Use Key Reference
Gradient Descent (GD) The foundational algorithm. Applied to find the minimum of a loss function by iteratively moving in the opposite direction of the gradient. $w_{new} = w_{old} - \eta \nabla L(w_{old})$ First-order Learning Rate ($\eta$) Conceptually simple, guaranteed convergence for convex functions. Computationally expensive for large datasets (requires full pass), can get stuck in local minima. Small datasets, understanding the core optimization concept. Generally attributed to Cauchy (1847) for solving astronomical problems.
Stochastic Gradient Descent (SGD) Developed to address the computational cost of GD on large datasets by updating parameters based on a single training example at a time. $w_{new} = w_{old} - \eta \nabla L(w_{old}; x^{(i)}, y^{(i)})$ First-order, Stochastic Learning Rate ($\eta$) Much faster than GD on large datasets, can escape shallow local minima due to noise. Noisy updates can cause oscillations, can be slower to converge precisely, sensitive to learning rate. Large datasets, online learning. Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400-407.
Mini-Batch Gradient Descent (MB-SGD) A compromise between GD and SGD to balance computational efficiency and the noise in updates. Updates are based on a small batch of examples. $w_{new} = w_{old} - \eta \nabla L(w_{old}; \text{batch})$ First-order Learning Rate ($\eta$), Batch Size More stable updates than SGD, more computationally efficient than GD, widely used in practice. Requires tuning of batch size, can still oscillate. Most deep learning tasks, standard practice. N/A (A widely adopted practical variation, not tied to a single foundational paper)
SGD with Momentum Introduced to accelerate SGD by adding a fraction of the previous update vector to the current update. Helps overcome inertia and oscillations. $v_{new} = \gamma v_{old} + \eta \nabla L(w_{old})$$w_{new} = w_{old} - v_{new}$ First-order, Momentum Learning Rate ($\eta$), Momentum ($\gamma$) Accelerates convergence, dampens oscillations, helps navigate flat regions and escape shallow local minima. Requires tuning of momentum coefficient, can overshoot the minimum. When faster convergence is needed, overcoming oscillations in SGD/MB-SGD. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. Parallel distributed processing: Explorations in the microstructure of cognition, 1(3), 318-362.
Nesterov Accelerated Gradient (NAG) An improvement over standard momentum that calculates the gradient at a "lookahead" point in the direction of the momentum. $v_{new} = \gamma v_{old} + \eta \nabla L(w_{old} - \gamma v_{old})$$w_{new} = w_{old} - v_{new}$ First-order, Momentum Learning Rate ($\eta$), Momentum ($\gamma$) Often converges faster than standard momentum, provides a more direct path to the minimum. Still requires tuning of hyperparameters. Similar to SGD with Momentum, can offer slight performance gains. Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate $O(1/k^2)$. Doklady Akademii Nauk SSSR, 269, 543-547.
AdaGrad (Adaptive Gradient) Developed to adapt the learning rate for each parameter individually, scaling it inversely to the history of squared gradients. Useful for sparse data. $G_{new} = G_{old} + (\nabla L(w_{old}))^2$ (element-wise)$w_{new} = w_{old} - \frac{\eta}{\sqrt{G_{new} + \epsilon}} \nabla L(w_{old})$ Adaptive Learning Rate Learning Rate ($\eta$), Epsilon ($\epsilon$) Adapts learning rates per-parameter, effective for sparse data where gradients vary greatly. Learning rate can become vanishingly small over time, slowing down convergence significantly. Sparse data, natural language processing tasks. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.
RMSProp (Root Mean Square Propagation) Addressed AdaGrad's rapidly decreasing learning rate by using an exponentially decaying average of squared gradients instead of the cumulative sum. $E[g^2]{new} = \gamma E[g^2]{old} + (1 - \gamma) (\nabla L(w_{old}))^2$$w_{new} = w_{old} - \frac{\eta}{\sqrt{E[g^2]{new} + \epsilon}} \nabla L(w{old})$ Adaptive Learning Rate Learning Rate ($\eta$), Decay Rate ($\gamma$), Epsilon ($\epsilon$) Prevents learning rate from decaying too quickly, generally works well in practice. Can still struggle with converging to the optimal solution in some cases. Non-stationary objectives, frequently used in deep learning. Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
AdaDelta An extension of AdaGrad that removes the need for a manual learning rate by using the ratio of the exponential decay average of previous updates and squared gradients. $\Delta w = - \frac{\sqrt{E[\Delta w^2]{old} + \epsilon}}{\sqrt{E[g^2]{new} + \epsilon}} \nabla L(w_{old})$$E[\Delta w^2]{new} = \gamma E[\Delta w^2]{old} + (1 - \gamma) (\Delta w)^2$$w_{new} = w_{old} + \Delta w$ Adaptive Learning Rate Decay Rate ($\gamma$), Epsilon ($\epsilon$) Does not require setting a learning rate, robust to different scales of gradients. Can be less intuitive than optimizers with a learning rate. Similar to RMSProp, when learning rate tuning is difficult. Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701.
Adam (Adaptive Moment Estimation) Combines the concepts of momentum and RMSProp, using exponentially decaying averages of both the first and second moments of gradients with bias correction. $m_{new} = \beta_1 m_{old} + (1 - \beta_1) \nabla L(w_{old})$$v_{new} = \beta_2 v_{old} + (1 - \beta_2) (\nabla L(w_{old}))^2$$\hat{m}{new} = \frac{m{new}}{1 - \beta_1^{t}}$$\hat{v}{new} = \frac{v{new}}{1 - \beta_2^{t}}$$w_{new} = w_{old} - \frac{\eta}{\sqrt{\hat{v}{new}} + \epsilon} \hat{m}{new}$ Adaptive Learning Rate, Momentum Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) Generally performs well and is robust to hyperparameter choices, fast convergence, handles sparse gradients. Can sometimes converge to a suboptimal solution compared to SGD with fine-tuning. Most deep learning tasks, a good default choice. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Adamax A variant of Adam that uses the infinity norm for the second moment, which can be more stable. Similar to Adam, but $v_{new} = \max(\beta_2 v_{old}, \nabla L(w_{old}) )$$w_{new} = w_{old} - \frac{\eta}{v_{new} + \epsilon} \hat{m}_{new}$ Adaptive Learning Rate, Momentum Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) More stable with large parameter values compared to Adam. Less commonly used than Adam, performance gains may be marginal.
Nadam (Nesterov + Adam) Combines Adam with the Nesterov accelerated gradient, incorporating the "lookahead" property. Incorporates Nesterov momentum into the Adam update rules. Adaptive Learning Rate, Momentum Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) Often provides slightly faster convergence and potentially better performance than Adam. Slightly more complex formulation than Adam. Similar to Adam, can be tried for potential improvements. Dozat, T. (2016). Incorporating Nesterov momentum into Adam. Available at http://cs229. stanford. edu/proj/28/NgoTruongSon_Report. pdf.
AMSGrad A modification of Adam that ensures the second moment estimate is non-decreasing to address potential convergence issues. Uses $\hat{v}{new} = \max(\hat{v}{old}, v_{new})$ in the Adam update. Adaptive Learning Rate, Momentum Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) Provides stronger theoretical convergence guarantees, can sometimes outperform Adam. Empirical performance gains over Adam are not always significant and can be problem-dependent. When stable convergence is critical, or Adam shows unstable behavior. Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237.
Lookahead A wrapper that improves the convergence and generalization of inner optimizers by using "slow" and "fast" weights. Periodically updates "slow" weights by interpolating towards "fast" weights optimized by an inner optimizer (e.g., Adam or SGD). Wrapper, Meta-optimizer Lookahead steps (k), Alpha ($\alpha$) (interpolation factor) Can lead to faster convergence and potentially better generalization, can be applied with existing optimizers. Adds complexity, requires tuning of additional hyperparameters. Improving the performance of base optimizers. Zhang, F., Song, Y., Gao, S., Ren, Z., Xing, E. P., & Ho, Q. (2019). Lookahead optimizer: k steps forward, 1 step back. Advances in Neural Information Processing Systems, 32.
RAdam (Rectified Adam) Addresses Adam's poor performance in early training steps due to the variance of the adaptive learning rate. Modifies the Adam update rule based on the variance of the adaptive learning rate, especially in early timesteps. Adaptive Learning Rate, Momentum Learning Rate ($\eta$), Decay Rates ($\beta_1, \beta_2$), Epsilon ($\epsilon$) More stable and reliable training in the early stages, less sensitive to initial learning rate choice. Can be slightly more complex to implement than standard Adam. Training deep networks from scratch, when Adam struggles initially. Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., ... & Zhao, J. (2020). On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265.
Fromage A more recent technique that leverages functional analysis to determine update directions, aiming for improved robustness. Uses functional norms and distances between probability distributions to guide parameter updates. (Complex mathematical formulation beyond simple equations). Functional Analysis-based Learning Rate ($\eta$), various internal parameters Aims for improved robustness and navigation of complex loss landscapes, theoretically grounded in functional analysis. Newer technique, less widely adopted and tested compared to established optimizers, can be computationally intensive. Research and experimentation, potentially for challenging optimization problems. Défossez, A., & Bach, F. (2021). Fromage: Learning on manifolds with functional distances. Advances in Neural Information Processing Systems, 34, 25363-25376.

Activation Functions

Activation Function Description and Why it was Applied Mathematical Formulation Type Key Hyperparameters Advantages Disadvantages Typical Use Cases / When to Use Key Reference
Linear The simplest activation. Output is directly proportional to the input. Used in early models but limits the network to linear mappings. Non-linear functions were needed for complex problems. $f(x) = x$ Linear None Simple to understand and implement. Cannot learn non-linear relationships, stacking layers provides no additional power beyond a single layer. Output layers for regression problems. N/A (Basic mathematical function)
Sigmoid (Logistic) One of the first popular non-linear activation functions. Squashes input values between 0 and 1. Used to introduce non-linearity and model probabilities. $f(x) = \frac{1}{1 + e^{-x}}$ Non-linear, Saturated None Introduces non-linearity, output is in a convenient range (0 to 1) for probabilities, differentiable. Suffers from the vanishing gradient problem (gradients are very small for large positive or negative inputs), output is not zero-centered. Output layers for binary classification. Historically used in early neural networks, e.g., in the context of the perceptron and backpropagation.
Tanh (Hyperbolic Tangent) A rescaled version of the sigmoid function, squashing inputs between -1 and 1. Zero-centered, which often helps in training. $f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ Non-linear, Saturated None Zero-centered output (helps with training), stronger gradients than sigmoid in the central region. Still suffers from the vanishing gradient problem. Hidden layers, especially in recurrent neural networks (RNNs). N/A (Basic mathematical function, application in ANNs is historical)
ReLU (Rectified Linear Unit) Became popular due to its simplicity and effectiveness in combating the vanishing gradient problem. Outputs the input directly if positive, and zero otherwise. $f(x) = \max(0, x)$ Non-linear, Non-saturated (for positive inputs) None Solves the vanishing gradient problem for positive inputs, computationally efficient, encourages sparsity. Can suffer from the "dying ReLU" problem (neurons can become inactive and never recover), not zero-centered. Hidden layers in most deep learning architectures (CNNs, Feedforward Networks). Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML-10), 807-814.
Leaky ReLU An attempt to address the dying ReLU problem by allowing a small non-zero gradient for negative inputs. $f(x) = \begin{cases} x & \text{if } x > 0 \ \alpha x & \text{if } x \le 0 \end{cases}$ (where $\alpha$ is a small positive constant, e.g., 0.01) Non-linear, Non-saturated Alpha ($\alpha$) Mitigates the dying ReLU problem, still computationally efficient. Performance is not always consistent across different tasks, requires tuning of $\alpha$. Hidden layers, as an alternative to ReLU. Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. Proc. ICML. Volume 30.
Parametric ReLU (PReLU) Extends Leaky ReLU by making the slope for negative inputs a learnable parameter. $f(x) = \begin{cases} x & \text{if } x > 0 \ \alpha x & \text{if } x \le 0 \end{cases}$ (where $\alpha$ is a learnable parameter) Non-linear, Non-saturated Alpha ($\alpha$) (learned) Can potentially learn the optimal slope for negative inputs, performs well on some image datasets. Can be prone to overfitting, requires more computational resources than ReLU or Leaky ReLU. Hidden layers, particularly in convolutional networks. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE international conference on computer vision, 1026-1034.
Exponential Linear Unit (ELU) Aims to make the mean activation close to zero, which can speed up learning. Has a smooth transition for negative inputs. $f(x) = \begin{cases} x & \text{if } x > 0 \ \alpha (e^x - 1) & \text{if } x \le 0 \end{cases}$ (where $\alpha$ is a positive constant) Non-linear, Non-saturated Alpha ($\alpha$) Can lead to faster convergence and better performance than ReLUs, avoids the dying ReLU problem, output mean is closer to zero. More computationally expensive than ReLUs due to the exponential function, requires tuning of $\alpha$. Hidden layers, as an alternative to ReLUs. Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.
Swish A smooth, non-monotonic activation function discovered through automated search. It is the product of the input and the sigmoid of the input. $f(x) = x \cdot \sigma(x)$ Non-linear, Non-monotonic None Often performs better than ReLU on deeper models, smooth and differentiable everywhere. More computationally expensive than ReLU. Hidden layers, often used in state-of-the-art models. Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941.
GELU (Gaussian Error Linear Unit) A smooth activation function that models the neuron activation probabilistically. It is the input multiplied by the cumulative distribution function of the standard Gaussian distribution. $f(x) = x \cdot P(X \le x)$ where $X \sim \mathcal{N}(0, 1)$ or approximately $f(x) = 0.5x(1 + \tanh(\sqrt{2/\pi}(x + 0.044715x^3)))$ Non-linear, Non-monotonic None Currently used in many state-of-the-art models (e.g., Transformers), shows strong performance in various tasks. More computationally expensive than ReLU. Hidden layers in models like Transformers. Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
Mish A smooth, non-monotonic activation function proposed as an alternative to ReLU and Swish. $f(x) = x \cdot \tanh(\text{softplus}(x))$ where $\text{softplus}(x) = \ln(1 + e^x)$ Non-linear, Non-monotonic None Smooth and non-monotonic, reported to perform better than Swish and ReLU on several datasets. More computationally expensive than ReLU. Hidden layers, exploring alternatives to common activations. Misra, D. (2019). Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681.
SwiGLU A gated activation function used in some recent transformer architectures. It is a variant of the Gated Linear Unit (GLU) that uses Swish for gating. $f(x, y) = x \odot \text{Swish}(y)$ where $x$ and $y$ are linear projections of the input, and $\odot$ is the Hadamard product. Non-linear, Gated None Used in high-performing language models, effective in capturing complex interactions between features. More complex than simpler activation functions. Transformer networks. Shazeer, N., Mirhoseini, A., Amodei, D., Le, Q., Pan, X., He, X., ... & Norouzi, M. (2020). GLU variants improve transformer. arXiv preprint arXiv:2002.05202.
KAN (Kolmogorov-Arnold Networks) A very recent approach that replaces fixed activation functions on nodes with learnable activation functions on the edges (synapses). Based on the Kolmogorov-Arnold representation theorem. The network structure and learning involve approximating functions on edges using splines, rather than applying a fixed function at nodes. Learnable Edge Activation Number of grid points, spline order Can potentially achieve higher accuracy with fewer parameters than MLPs with fixed activation functions, theoretically grounded in function approximation. Computationally more expensive than standard MLPs, still an active area of research and development. Research and experimentation, potential for improved model efficiency. Liu, Z., Wang, Y., Shen, S., Han, B., & Tang, J. (2024). KAN: Kolmogorov-Arnold Networks. arXiv preprint arXiv:2406.16799.