Advanced Optimizers - Nerogar/OneTrainer GitHub Wiki

These are optimizers equipped with the newest, advanced techniques and methods from academic research papers.

Every feature should work independently, but using multiple features simultaneously is experimental, as they may interact in beneficial or detrimental ways.

Included Optimizers:

Optimizer	Description	Best For
`Adam_Adv`	Advanced Adam implementation	General purpose
`Adopt_Adv`	Adam-variant with independent beta2	Stable training for small batch size regimes
`Prodigy_Adv`	Prodigy with D-Adaptation	Adam with automatic LR tuning
`Simplified_AdEMAMix`	Adam variant with accumulator momentum	Small/large batch training when tuned correctly
`Lion_Adv`	Advanced Lion implementation	General purpose
`Prodigy_Lion_Adv`	Prodigy + Lion combination	Lion with automatic LR tuning

Included Features:

A. Universal Safe Features

These features work with all optimizers and are generally safe to enable.

Feature	Description	Recommended Usage	Performance Impact	Theoretical Basis	Compatibility
Fused Back Pass	Fuses backward pass; gradients are used immediately and memory is freed on-the-fly.	Memory-constrained environments	Reduces peak memory	memory optimization	All optimizers
Stochastic Rounding	Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16.	BF16 training	Minimal overhead (<5%)	Revisiting BFloat16 Training	All optimizers
OrthoGrad	Removes gradient component parallel to weights to reduce overfitting.	Full fine-tuning without weight decay	+33% time overhead for effective BS=4. Larger EBS = less overhead.	Grokking at Edge	All optimizers
Factored	Memory-efficient optimization via rank-1 1-bit factorization of optimizer states.	Large models / memory-limited hardware	Adds compression overhead	SMMF	All optimizers

B. Individual Features:

Feature	Description	Recommended Usage	Performance Impact	Theoretical Basis	Compatibility
Cautious	Only applies update if gradient direction aligns with momentum direction.	Accelerating convergence	No overhead	C-Optim	Adam/Adopt/Prodigy/Lion
Grams	Update direction derived purely from current gradient.	When Cautious is insufficient	No overhead	Grams	Adam/Adopt/Prodigy
AdEMAMix	Dual EMA system that retains relevance of gradients over tens of thousands of steps, accelerating convergence and reducing forgetting	Long training runs, especially where model forgetting is a concern	+1 state memory	AdEMAMix	Adam/Adopt/Prodigy
Simplified_AdEMAMix	Accumulator-based momentum, single EMA variant of AdEMAMix	All scenarios when tuned correctly	No overhead	Connections	Adam/Adopt/Prodigy
atan2	Robust epsilon replacement with built-in gradient clipping	Use for stable bounded updates (or for Adopt as it needs that)	No overhead	Adam-atan2	Adam/Adopt/Prodigy
Kourkoutas-β	Layer-wise adaptive β₂ based on gradient “sunspike” ratio	Noisy/small-BS/high-LR training	Minimal overhead	Kourkoutas-β	Adam/Adopt/Prodigy/Simplified_AdEMAMix

Details of the Features

This tutorial is targeted at small batch size regimes (1–64), as this is the largest audience for OT.
Large batch size trainings (≥512) should be tuned differently.

Cautious and Grams:

These two share the same core idea of modifying the momentum relative to the raw gradient:

Cautious does it softly by masking and waiting for the momentum to align its direction with the raw gradients before updating.
Grams is more aggressive, as it forces the momentum to adopt the sign of the gradients.

Grams is argued to be better in its paper, but you can use one of these if you want.

If you enable both, Grams will be enabled and Cautious will be disabled.

AdEMAMix:

This feature adds a second EMA to the optimizer, which decays very slowly and retains memory over tens of thousands of steps (depending on the beta3 value).

AdEMAMix deserves more attention for a few reasons:

It’s proven - theoretically and empirically in academic research papers - that Adam’s first moment is nearly useless for small batch sizes. You can often use just the second moment and see no difference.

Reference: Check section “Momentum and Batch Size” in the paper:
AdaMeM: Memory Efficient Momentum for Adafactor

as it covers most of the papers that stated this.

AdEMAMix’s EMA, on the other hand, is proven to be very beneficial for small batch sizes. You can skip Adam’s EMA entirely and use only AdEMAMix’s EMA in these scenarios.

Tunable Hyperparameters for AdEMAMix

Parameter	Default	Tuning Guide
`beta3`	0.9999	Should be chosen based on training length: • Runs >120k steps: keep at 0.9999 • Runs ≤120k steps: try 0.999 Note: I keep it at 0.9999 by default, but this guide follows recommendations from the AdEMAMix paper.
`alpha`	5	Multiplier for AdEMAMix’s EMA contribution. • Value 5 may be high - consider lowering LR. • If training diverges, reduce to 2–3. • Increase to strengthen AdEMAMix’s effect; decrease to weaken it.

⚠️ Notice AdEMAMix requires 2–4× lower learning rate than Adam to remain stable, as its update steps are larger.

Simplified_AdEMAMix:

This is an interesting and unique method introduced in the Simplified_AdEMAMix paper (arXiv:2502.02431).
Since Adam’s EMA has been shown to benefit large batch sizes, and AdEMAMix’s slow EMA benefits small batch sizes, why not align both approaches?
That’s exactly what the paper proposes: it replaces Adam’s first moment (EMA) with a gradient accumulator, capturing the strengths of both methods. The paper demonstrates - theoretically and empirically - that this single modified accumulator can match or even exceed the efficiency of both Adam and AdEMAMix when properly tuned.

📌 Key Insight from the Paper:
Classical momentum methods (e.g., Adam’s EMA) do not accelerate the optimizer in noisy regimes (small BS).
The accumulator + Grad α design in Simplified_AdEMAMix directly addresses this:

The accumulator accelerates the optimizer by accumulating the gradients through steps.

Grad α scales the raw gradient, allowing controlled emphasis on recent updates, ideal for cutting through noise in small-batch training.

⚠️ Important Compatibility Notes:
Due to its fundamental modification of the first moment:

It is incompatible with features relying on standard momentum (e.g., Grams, Cautious).
It is incompatible with atan2 or any standard gradient clipping, due to its inherently large update steps.
It typically requires a 100x lower learning rate (also related to the Grad α parameter explained below).

Tunable Hyperparameters for Simplified_AdEMAMix

Parameter	Default	Tuning Guide
`beta1`	0.99	Now controls the effective memory length of the accumulator (previously EMA decay). • `0.9` → ~10-step memory → too short for small batches. • For small batch sizes, use 0.99 to 0.9999, depending on training length and desired stability.
`Grad α Smoothing Factor`	100	Most critical parameter to tune. Controls the weight of the raw, current gradient in the final update. • Should scale inversely with batch size (i.e., reduce linearly as batch size increases). • High values (10–100): Emphasize recent gradients, ideal for small batches to cut through noise and adapt quickly.

⚠️ Important Notes:

You must set these hyperparameters when enabling Simplified_AdEMAMix with Prodigy_Adv or Adopt_Adv.
Since Simplified_AdEMAMix requires a 100x smaller learning rate, set initial d in Prodigy_Adv to:
- 1e-8 for LoRA
- 1e-10 for full fine-tuning

Atan2:

A simple yet effective method for Adam-based optimizers that improves upon the eps hyperparameter, completely replacing it and making the optimizer scale-invariant.

It bounds the update step within the range [-2, 2], effectively acting as built-in update clipping to prevent large, destabilizing jumps during optimization.

✅ Highly Recommended for Unstable Optimizers:
This is especially beneficial for optimizers like Adopt_Adv, which typically require high eps values and update clipping to remain stable during training.

Kourkoutas-β

Kourkoutas-β introduces a sunspike-driven, layer-wise adaptive second-moment decay (β₂) as an optional enhancement for Adam_Adv, Adopt_Adv, Prodigy_Adv, and Simplified_AdEMAMix.

Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it dynamically modulates β₂ per layer based on a bounded sunspike ratio:

During gradient bursts → β₂ ↓ toward Lower β₂ → faster reaction
During calm phases → β₂ ↑ toward The Selected β₂ → stronger smoothing

This is especially effective for noisy training, small batch sizes, and high learning rates, where gradient norms shift abruptly due to noise or aggressive LR schedules.

Pros/Cons

Category	Details
✅ Pros	• Layer-wise adaptation blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction). • Robust to sudden loss landscape shifts, reacts quickly during gradient bursts, smooths during calm phases. • High tolerance to aggressive learning rates.
⚠️ Cons	• Potentially unstable at the start of training due to unreliable early gradient norms; mitigated by using `K-β Warmup Steps`.

💡 Best Practice: Set K_warmup_steps equal to your standard LR warmup steps. During warmup, the optimizer uses the static beta2; adaptation begins only after warmup ends.

Usage

Enable Kourkoutas Beta
Set beta2 = 0.999 and let it adapt.

`Factored` Notes:

When Factored is enabled (True), it will factor all optimizer states. Regardless of whether they belong to AdEMAMix’s slow EMA, Simplified_AdEMAMix’s accumulator, or Lion’s momentum.
Ranking of Optimizers by Tolerance to Factored 1-bit Factorization:
1. Lion: Most tolerant - produced identical results in my tests.
2. Simplified_AdEMAMix: Next in line - due to its large updates based on raw, uncompressed gradients, it should yield identical or very comparable results.
3. Adam / Prodigy / Adopt: Deliver comparable results - and are still worth using for the memory savings (16x–32x smaller optimizer states).

`Prodigy_Adv` Notes:

Note for PRODIGY_Adv hyperparametes:
- beta3 = beta3 for Prodigy's calculations; leave empty, it won’t hurt.
- beta3 EMA = beta3 for AdEMAMix's EMA, which you tune depending on your run length. It will only take effect when AdEMAMix EMA is true.
When the Simplified_AdEMAMix option is used with Prodigy_Adv, the d_coef is scaled to d_coef / alpha. This effectively tells Prodigy, "Whatever learning rate you find, you must make it alpha times smaller," which should properly align the two optimizers (tested and validated in Finetune/LoRA/Embedding). You can also use (d_limiter) instead but some users reported bad results with it.

⚠️ Manual Tuning Still Required if beta1 Changes:
As beta1 increases → momentum becomes “older” → updates become more aggressive (“brute”).
Compensate by reducing d_coef:

beta1 Recommended d_coef

0.99 1.0 (default)

0.995 0.5

0.999 0.1

0.9999 0.01

📌 Why? Higher beta1 extends the memory of the accumulator, causing larger cumulative updates, requiring further LR dampening via d_coef.

⚠️ Notice Lower d_coef values require more steps to find LR but it should be more stable and safe.

`beta1`	Recommended `d_coef`
0.99	1.0 (default)
0.995	0.5
0.999	0.1
0.9999	0.01

`Lion_Prodigy_Adv` Notes:

Lion is a sign-based optimizer that produces large updates per step. This can also destabilize Prodigy calculations, leading to an overestimated LR. In newer versions, we added the option d_limiter to it. This is True by default and should remain True, as it forces Prodigy to behave well with Lion, resulting in a suitable LR.

Other Notes:

For optimizers (Adam_Adv, ADOPT_Adv, Prodigy_Adv): When beta1 is set to 0 (or left unset/None), the optimizer will skip the first moment (standard EMA) entirely.
✅ This is fully compatible with use_AdEMAMix option, allowing you to use only AdEMAMix’s slow EMA while bypassing Adam’s traditional momentum. Ideal for small-batch training where standard EMA adds little to no value
Lion is the optimizer most closely related to AdEMAMix/Simplified_AdEMAMix, due to its typical setting of beta1 < beta2 (e.g., 0.9 and 0.99). This gives roughly 10x more weight to the raw gradient compared to historical momentum. Behavior similar to Simplified_AdEMAMix with alpha=10, but using a sign operation instead of adaptive learning rate scaling.
Update Rule Differences Across Optimizers:
- AdEMAMix:
  update = fast_EMA + (α * slow_EMA)
  → Blends short-term EMA with amplified long-term EMA.
- Simplified_AdEMAMix:
  update = (α * grad) + accumulator
  → Directly scales current gradient and adds it to a decaying accumulator.
- Adam:
  update = first_moment_EMA
  → Standard exponential moving average of gradients.
Note for skippable hyperparameters:
- When AdEMAMix EMA is False, the beta3 EMA and alpha parameters will be skipped and have no effect.
- When Simplified_AdEMAMix is False, the grad α parameter will be skipped and have no effect.

🧙‍♂️ Updated Magical Preset (2.0)

This is the preset that worked best for me and delivered the strongest/fastest results across my tests (Finetunes/LoRAs/Embeddings):

Learning Rate: 1
optimizer: PRODIGY_Adv
settings:
  - beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
  - beta2: 0.999 # Since we will use k-b.
  - Simplified_AdEMAMix: True
  - Grad α: 100 # Controls weighting of raw gradient in update
  - OrthoGrad: True
  - weight_decay: 0.0
  - initial_d: 
      • LoRA: 1e-8
      • Full fine-tune: 1e-10
      • Embedding: 1e-7
  - d_coef: 1
  - Kourkoutas Beta: True
  - K-β Warmup Steps: 20 # in range of 20-200 or the number of warm-up steps you usually use for LR.
  - factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.

⚠️ Note: For this preset, if PRODIGY_Adv is too slow in raising LR at the start of the training, you can try increasing initial_d from 1e-8 to 1e-7 for LoRAs, and from 1e-10 to 1e-9 for fine-tuning.

⚠️ Removed d_limiter, as it was a poor addition for some users.

✅ Why this works well:

beta1=0.99 stabilizes momentum while allowing Simplified_AdEMAMix to dominate the update direction via raw gradients.
beta2=0.999 and Kourkoutas Beta stabilize the optimizer while adapting beta2 per layer.
Grad α=100 emphasizes recent gradients - ideal for small-batch, noisy regimes.
OrthoGrad prevents overfitting without weight decay (set weight_decay=0.0). Or you can use weight_decay for LoRA and embeddings as it's safe for them.
factored=True enables 1-bit factorization of optimizer states, offering 16x–32x memory reduction with minimal accuracy loss. thanks to Simplified_AdEMAMix’s inherent tolerance to state compression (as its updates rely heavily on raw, uncompressed gradients).