Advanced Optimizers - Nerogar/OneTrainer GitHub Wiki
These are optimizers equipped with the newest, advanced techniques and methods from academic research papers.
Every feature should work independently, but using multiple features simultaneously is experimental, as they may interact in beneficial or detrimental ways.
- Added advanced variants of Muon optimizer with features and settings from recent papers, refer to: Orthogonal Optimizers page
- Added Cautious Weight Decay for all adv optimizers, refer to: Cautious Weight Decay section
| Optimizer | Description | Best For |
|---|---|---|
Adam_Adv |
Advanced Adam implementation | General purpose |
Adopt_Adv |
Adam-variant with independent beta2 | Stable training for small batch size regimes |
Prodigy_Adv |
Prodigy with D-Adaptation | Adam with automatic LR tuning |
Simplified_AdEMAMix |
Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
Lion_Adv |
Advanced Lion implementation | General purpose |
Prodigy_Lion_Adv |
Prodigy + Lion combination | Lion with automatic LR tuning |
These features work with all optimizers and are generally safe to enable.
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|---|---|---|---|---|---|
| Fused Back Pass | Fuses backward pass; gradients are used immediately and memory is freed on-the-fly. | Memory-constrained environments | Reduces peak memory | memory optimization | All optimizers |
| Stochastic Rounding | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16. | BF16 training | Minimal overhead (<5%) | Revisiting BFloat16 Training | All optimizers |
| OrthoGrad | Removes gradient component parallel to weights to reduce overfitting. | Full fine-tuning without weight decay | +33% time overhead for effective BS=4. Larger EBS = less overhead. | Grokking at Edge | All optimizers |
| Factored | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states. | Large models / memory-limited hardware | Adds compression overhead | SMMF | All optimizers |
| Cautious Weight Decay | Modifications to weight decay that only applies it when the update direction aligns with the parameter sign. | General purpose (requires Weight Decay > 0) |
No overhead | Cautious Weight Decay | All optimizers |
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|---|---|---|---|---|---|
| Cautious | Only applies update if gradient direction aligns with momentum direction. | Accelerating convergence | No overhead | C-Optim | Adam/Adopt/Prodigy/Lion |
| Grams | Update direction derived purely from current gradient. | When Cautious is insufficient | No overhead | Grams | Adam/Adopt/Prodigy |
| AdEMAMix | Dual EMA system that retains relevance of gradients over tens of thousands of steps, accelerating convergence and reducing forgetting | Long training runs, especially where model forgetting is a concern | +1 state memory | AdEMAMix | Adam/Adopt/Prodigy |
| Simplified_AdEMAMix | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | Connections | Adam/Adopt/Prodigy |
| atan2 | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | Adam-atan2 | Adam/Adopt/Prodigy |
| Kourkoutas-β | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small-BS/high-LR training | Minimal overhead | Kourkoutas-β | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
- This tutorial is targeted at small batch size regimes (1–64), as this is the largest audience for OT.
- Large batch size trainings (≥512) should be tuned differently.
These two share the same core idea of modifying the momentum relative to the raw gradient:
-
Cautious does it softly by masking and waiting for the momentum to align its direction with the raw gradients before updating.
-
Grams is more aggressive, as it forces the momentum to adopt the sign of the gradients.
Grams is argued to be better in its paper, but you can use one of these if you want.
- If you enable both, Grams will be enabled and Cautious will be disabled.
This feature adds a second EMA to the optimizer, which decays very slowly and retains memory over tens of thousands of steps (depending on the beta3 value).
AdEMAMix deserves more attention for a few reasons:
- It’s proven - theoretically and empirically in academic research papers - that Adam’s first moment is nearly useless for small batch sizes. You can often use just the second moment and see no difference.
Reference: Check section “Momentum and Batch Size” in the paper:
AdaMeM: Memory Efficient Momentum for Adafactor
- as it covers most of the papers that stated this.
- AdEMAMix’s EMA, on the other hand, is proven to be very beneficial for small batch sizes. You can skip Adam’s EMA entirely and use only AdEMAMix’s EMA in these scenarios.
| Parameter | Default | Tuning Guide |
|---|---|---|
beta3 |
0.9999 | Should be chosen based on training length: • Runs >120k steps: keep at 0.9999 • Runs ≤120k steps: try 0.999 Note: I keep it at 0.9999 by default, but this guide follows recommendations from the AdEMAMix paper. |
alpha |
5 | Multiplier for AdEMAMix’s EMA contribution. • Value 5 may be high - consider lowering LR. • If training diverges, reduce to 2–3. • Increase to strengthen AdEMAMix’s effect; decrease to weaken it. |
⚠️ Notice AdEMAMix requires 2–4× lower learning rate than Adam to remain stable, as its update steps are larger.
This is an interesting and unique method introduced in the Simplified_AdEMAMix paper (arXiv:2502.02431).
Since Adam’s EMA has been shown to benefit large batch sizes, and AdEMAMix’s slow EMA benefits small batch sizes, why not align both approaches?
That’s exactly what the paper proposes: it replaces Adam’s first moment (EMA) with a gradient accumulator, capturing the strengths of both methods. The paper demonstrates - theoretically and empirically - that this single modified accumulator can match or even exceed the efficiency of both Adam and AdEMAMix when properly tuned.
📌 Key Insight from the Paper:
Classical momentum methods (e.g., Adam’s EMA) do not accelerate the optimizer in noisy regimes (small BS).
The accumulator +Grad αdesign in Simplified_AdEMAMix directly addresses this:
- The accumulator accelerates the optimizer by accumulating the gradients through steps.
Grad αscales the raw gradient, allowing controlled emphasis on recent updates, ideal for cutting through noise in small-batch training.
Due to its fundamental modification of the first moment:
- It is incompatible with features relying on standard momentum (e.g., Grams, Cautious).
- It is incompatible with
atan2or any standard gradient clipping, due to its inherently large update steps. - It typically requires a 100x lower learning rate (also related to the
Grad αparameter explained below).
| Parameter | Default | Tuning Guide |
|---|---|---|
beta1 |
0.99 | Now controls the effective memory length of the accumulator (previously EMA decay). • 0.9 → ~10-step memory → too short for small batches.• For small batch sizes, use 0.99 to 0.9999, depending on training length and desired stability. |
Grad α Smoothing Factor |
100 |
Most critical parameter to tune. Controls the weight of the raw, current gradient in the final update. • Should scale inversely with batch size (i.e., reduce linearly as batch size increases). • High values (10–100): Emphasize recent gradients, ideal for small batches to cut through noise and adapt quickly. |
- You must set these hyperparameters when enabling
Simplified_AdEMAMixwithProdigy_AdvorAdopt_Adv. - Since
Simplified_AdEMAMixrequires a 100x smaller learning rate, setinitial dinProdigy_Advto:-
1e-8for LoRA -
1e-10for full fine-tuning
-
A simple yet effective method for Adam-based optimizers that improves upon the eps hyperparameter, completely replacing it and making the optimizer scale-invariant.
It bounds the update step within the range [-2, 2], effectively acting as built-in update clipping to prevent large, destabilizing jumps during optimization.
✅ Highly Recommended for Unstable Optimizers:
This is especially beneficial for optimizers likeAdopt_Adv, which typically require highepsvalues and update clipping to remain stable during training.
Kourkoutas-β introduces a sunspike-driven, layer-wise adaptive second-moment decay (β₂) as an optional enhancement for Adam_Adv, Adopt_Adv, Prodigy_Adv, and Simplified_AdEMAMix.
Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it dynamically modulates β₂ per layer based on a bounded sunspike ratio:
-
During gradient bursts → β₂ ↓ toward
Lower β₂→ faster reaction -
During calm phases → β₂ ↑ toward
The Selected β₂→ stronger smoothing
This is especially effective for noisy training, small batch sizes, and high learning rates, where gradient norms shift abruptly due to noise or aggressive LR schedules.
| Category | Details |
|---|---|
| ✅ Pros | • Layer-wise adaptation blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction). • Robust to sudden loss landscape shifts, reacts quickly during gradient bursts, smooths during calm phases. • High tolerance to aggressive learning rates. |
• Potentially unstable at the start of training due to unreliable early gradient norms; mitigated by using K-β Warmup Steps. |
💡 Best Practice: Set
K_warmup_stepsequal to your standard LR warmup steps. During warmup, the optimizer uses the staticbeta2; adaptation begins only after warmup ends.
- Enable
Kourkoutas Beta - Set
beta2 = 0.999and let it adapt.
This is a new technique that refines how Weight Decay is applied.
In standard optimization (like AdamW or Lion), Weight Decay always pulls parameters toward zero, regardless of what the loss function wants. This often creates a conflict where the optimizer wants to increase a value, but weight decay fights it by trying to decrease it.
CWD solves this by applying a simple logic mask:
- If the optimizer update and the parameter sign align (agree): Apply weight decay to help shrink the value.
- If they disagree (conflict): Turn off weight decay for that specific parameter and let the optimizer do its job.
Do not confuse Cautious with Cautious Weight Decay:
- Cautious (C-Optim): Masks the optimizer update based on momentum alignment.
- Cautious Weight Decay (CWD): Masks the weight decay based on parameter alignment.
- Can you use both? Yes. They operate on different parts of the step function.
- Simply enable it and set your
Weight Decayto a value> 0. -
No tuning required: It uses the exact same
Weight Decayvalue you would normally use. - Performance: The paper demonstrates consistently lower final loss and better generalization on both Vision and LLM tasks.
-
When
Factoredis enabled (True), it will factor all optimizer states. Regardless of whether they belong to AdEMAMix’s slow EMA, Simplified_AdEMAMix’s accumulator, or Lion’s momentum. -
Ranking of Optimizers by Tolerance to
Factored1-bit Factorization:- Lion: Most tolerant - produced identical results in my tests.
- Simplified_AdEMAMix: Next in line - due to its large updates based on raw, uncompressed gradients, it should yield identical or very comparable results.
- Adam / Prodigy / Adopt: Deliver comparable results - and are still worth using for the memory savings (16x–32x smaller optimizer states).
-
Note for
PRODIGY_Advhyperparametes:-
beta3=beta3for Prodigy's calculations; leave empty, it won’t hurt. -
beta3 EMA=beta3for AdEMAMix's EMA, which you tune depending on your run length. It will only take effect whenAdEMAMix EMAis true.
-
-
When the
Simplified_AdEMAMixoption is used withProdigy_Adv, thed_coefis scaled tod_coef / alpha. This effectively tells Prodigy, "Whatever learning rate you find, you must make italphatimes smaller," which should properly align the two optimizers (tested and validated in Finetune/LoRA/Embedding). You can also use (d_limiter) instead but some users reported bad results with it.⚠️ Manual Tuning Still Required ifbeta1Changes:
Asbeta1increases → momentum becomes “older” → updates become more aggressive (“brute”).
Compensate by reducingd_coef:beta1Recommended d_coef0.99 1.0 (default) 0.995 0.5 0.999 0.1 0.9999 0.01 📌 Why? Higher
beta1extends the memory of the accumulator, causing larger cumulative updates, requiring further LR dampening viad_coef.⚠️ Notice Lowerd_coefvalues require more steps to find LR but it should be more stable and safe.
- Lion is a sign-based optimizer that produces large updates per step. This can also destabilize Prodigy calculations, leading to an overestimated LR. In newer versions, we added the option
d_limiterto it. This isTrueby default and should remain True, as it forces Prodigy to behave well with Lion, resulting in a suitable LR.
-
For optimizers (
Adam_Adv,ADOPT_Adv,Prodigy_Adv): Whenbeta1is set to0(or left unset/None), the optimizer will skip the first moment (standard EMA) entirely.
✅ This is fully compatible withuse_AdEMAMixoption, allowing you to use only AdEMAMix’s slow EMA while bypassing Adam’s traditional momentum. Ideal for small-batch training where standard EMA adds little to no value -
Lion is the optimizer most closely related to
AdEMAMix/Simplified_AdEMAMix, due to its typical setting ofbeta1 < beta2(e.g., 0.9 and 0.99). This gives roughly 10x more weight to the raw gradient compared to historical momentum. Behavior similar toSimplified_AdEMAMixwithalpha=10, but using a sign operation instead of adaptive learning rate scaling. -
Update Rule Differences Across Optimizers:
-
AdEMAMix:
update = fast_EMA + (α * slow_EMA)
→ Blends short-term EMA with amplified long-term EMA. -
Simplified_AdEMAMix:
update = (α * grad) + accumulator
→ Directly scales current gradient and adds it to a decaying accumulator. -
Adam:
update = first_moment_EMA
→ Standard exponential moving average of gradients.
-
-
Note for skippable hyperparameters:
- When
AdEMAMix EMAisFalse, thebeta3 EMAandalphaparameters will be skipped and have no effect. - When
Simplified_AdEMAMixisFalse, thegrad αparameter will be skipped and have no effect.
- When
TODO