Advanced Optimizers - Nerogar/OneTrainer GitHub Wiki
These are optimizers equipped with the newest, advanced techniques and methods from academic research papers.
Every feature should work independently, but using multiple features simultaneously is experimental, as they may interact in beneficial or detrimental ways.
Optimizer | Description | Best For |
---|---|---|
Adam_Adv |
Advanced Adam implementation | General purpose |
Adopt_Adv |
Adam-variant with independent beta2 | Stable training for small batch size regimes |
Prodigy_Adv |
Prodigy with D-Adaptation | Adam with automatic LR tuning |
Simplified_AdEMAMix |
Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
Lion_Adv |
Advanced Lion implementation | General purpose |
Prodigy_Lion_Adv |
Prodigy + Lion combination | Lion with automatic LR tuning |
These features work with all optimizers and are generally safe to enable.
Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
---|---|---|---|---|---|
Fused Back Pass | Fuses backward pass; gradients are used immediately and memory is freed on-the-fly. | Memory-constrained environments | Reduces peak memory | memory optimization | All optimizers |
Stochastic Rounding | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16. | BF16 training | Minimal overhead (<5%) | Revisiting BFloat16 Training | All optimizers |
OrthoGrad | Removes gradient component parallel to weights to reduce overfitting. | Full fine-tuning without weight decay | +33% time overhead for effective BS=4. Larger EBS = less overhead. | Grokking at Edge | All optimizers |
Factored | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states. | Large models / memory-limited hardware | Adds compression overhead | SMMF | All optimizers |
Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
---|---|---|---|---|---|
Cautious | Only applies update if gradient direction aligns with momentum direction. | Accelerating convergence | No overhead | C-Optim | Adam/Adopt/Prodigy/Lion |
Grams | Update direction derived purely from current gradient. | When Cautious is insufficient | No overhead | Grams | Adam/Adopt/Prodigy |
AdEMAMix | Dual EMA system that retains relevance of gradients over tens of thousands of steps, accelerating convergence and reducing forgetting | Long training runs, especially where model forgetting is a concern | +1 state memory | AdEMAMix | Adam/Adopt/Prodigy |
Simplified_AdEMAMix | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | Connections | Adam/Adopt/Prodigy |
atan2 | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | Adam-atan2 | Adam/Adopt/Prodigy |
Kourkoutas-β | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small-BS/high-LR training | Minimal overhead | Kourkoutas-β | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
- This tutorial is targeted at small batch size regimes (1–64), as this is the largest audience for OT.
- Large batch size trainings (≥512) should be tuned differently.
These two share the same core idea of modifying the momentum relative to the raw gradient:
-
Cautious does it softly by masking and waiting for the momentum to align its direction with the raw gradients before updating.
-
Grams is more aggressive, as it forces the momentum to adopt the sign of the gradients.
Grams is argued to be better in its paper, but you can use one of these if you want.
- If you enable both, Grams will be enabled and Cautious will be disabled.
This feature adds a second EMA to the optimizer, which decays very slowly and retains memory over tens of thousands of steps (depending on the beta3
value).
AdEMAMix deserves more attention for a few reasons:
- It’s proven - theoretically and empirically in academic research papers - that Adam’s first moment is nearly useless for small batch sizes. You can often use just the second moment and see no difference.

Reference: Check section “Momentum and Batch Size” in the paper:
AdaMeM: Memory Efficient Momentum for Adafactor
- as it covers most of the papers that stated this.
- AdEMAMix’s EMA, on the other hand, is proven to be very beneficial for small batch sizes. You can skip Adam’s EMA entirely and use only AdEMAMix’s EMA in these scenarios.
Parameter | Default | Tuning Guide |
---|---|---|
beta3 |
0.9999 | Should be chosen based on training length: • Runs >120k steps: keep at 0.9999 • Runs ≤120k steps: try 0.999 Note: I keep it at 0.9999 by default, but this guide follows recommendations from the AdEMAMix paper. |
alpha |
5 | Multiplier for AdEMAMix’s EMA contribution. • Value 5 may be high - consider lowering LR. • If training diverges, reduce to 2–3. • Increase to strengthen AdEMAMix’s effect; decrease to weaken it. |
⚠️ Notice AdEMAMix requires 2–4× lower learning rate than Adam to remain stable, as its update steps are larger.
This is an interesting and unique method introduced in the Simplified_AdEMAMix paper (arXiv:2502.02431).
Since Adam’s EMA has been shown to benefit large batch sizes, and AdEMAMix’s slow EMA benefits small batch sizes, why not align both approaches?
That’s exactly what the paper proposes: it replaces Adam’s first moment (EMA) with a gradient accumulator, capturing the strengths of both methods. The paper demonstrates - theoretically and empirically - that this single modified accumulator can match or even exceed the efficiency of both Adam and AdEMAMix when properly tuned.
📌 Key Insight from the Paper:
Classical momentum methods (e.g., Adam’s EMA) do not accelerate the optimizer in noisy regimes (small BS).
The accumulator +Grad α
design in Simplified_AdEMAMix directly addresses this:
- The accumulator accelerates the optimizer by accumulating the gradients through steps.
Grad α
scales the raw gradient, allowing controlled emphasis on recent updates, ideal for cutting through noise in small-batch training.
Due to its fundamental modification of the first moment:
- It is incompatible with features relying on standard momentum (e.g., Grams, Cautious).
- It is incompatible with
atan2
or any standard gradient clipping, due to its inherently large update steps. - It typically requires a 100x lower learning rate (also related to the
Grad α
parameter explained below).
Parameter | Default | Tuning Guide |
---|---|---|
beta1 |
0.99 | Now controls the effective memory length of the accumulator (previously EMA decay). • 0.9 → ~10-step memory → too short for small batches.• For small batch sizes, use 0.99 to 0.9999, depending on training length and desired stability. |
Grad α Smoothing Factor |
100 |
Most critical parameter to tune. Controls the weight of the raw, current gradient in the final update. • Should scale inversely with batch size (i.e., reduce linearly as batch size increases). • High values (10–100): Emphasize recent gradients, ideal for small batches to cut through noise and adapt quickly. |
- You must set these hyperparameters when enabling
Simplified_AdEMAMix
withProdigy_Adv
orAdopt_Adv
. - Since
Simplified_AdEMAMix
requires a 100x smaller learning rate, setinitial d
inProdigy_Adv
to:-
1e-8
for LoRA -
1e-10
for full fine-tuning
-
A simple yet effective method for Adam-based optimizers that improves upon the eps
hyperparameter, completely replacing it and making the optimizer scale-invariant.
It bounds the update step within the range [-2, 2]
, effectively acting as built-in update clipping to prevent large, destabilizing jumps during optimization.
✅ Highly Recommended for Unstable Optimizers:
This is especially beneficial for optimizers likeAdopt_Adv
, which typically require higheps
values and update clipping to remain stable during training.
Kourkoutas-β introduces a sunspike-driven, layer-wise adaptive second-moment decay (β₂) as an optional enhancement for Adam_Adv
, Adopt_Adv
, Prodigy_Adv
, and Simplified_AdEMAMix
.
Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it dynamically modulates β₂ per layer based on a bounded sunspike ratio:
-
During gradient bursts → β₂ ↓ toward
Lower β₂
→ faster reaction -
During calm phases → β₂ ↑ toward
The Selected β₂
→ stronger smoothing
This is especially effective for noisy training, small batch sizes, and high learning rates, where gradient norms shift abruptly due to noise or aggressive LR schedules.
Category | Details |
---|---|
✅ Pros | • Layer-wise adaptation blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction). • Robust to sudden loss landscape shifts, reacts quickly during gradient bursts, smooths during calm phases. • High tolerance to aggressive learning rates. |
• Potentially unstable at the start of training due to unreliable early gradient norms; mitigated by using K-β Warmup Steps . |
💡 Best Practice: Set
K_warmup_steps
equal to your standard LR warmup steps. During warmup, the optimizer uses the staticbeta2
; adaptation begins only after warmup ends.
- Enable
Kourkoutas Beta
- Set
beta2 = 0.999
and let it adapt.
-
When
Factored
is enabled (True
), it will factor all optimizer states. Regardless of whether they belong to AdEMAMix’s slow EMA, Simplified_AdEMAMix’s accumulator, or Lion’s momentum. -
Ranking of Optimizers by Tolerance to
Factored
1-bit Factorization:- Lion: Most tolerant - produced identical results in my tests.
- Simplified_AdEMAMix: Next in line - due to its large updates based on raw, uncompressed gradients, it should yield identical or very comparable results.
- Adam / Prodigy / Adopt: Deliver comparable results - and are still worth using for the memory savings (16x–32x smaller optimizer states).
-
Note for
PRODIGY_Adv
hyperparametes:-
beta3
=beta3
for Prodigy's calculations; leave empty, it won’t hurt. -
beta3 EMA
=beta3
for AdEMAMix's EMA, which you tune depending on your run length. It will only take effect whenAdEMAMix EMA
is true.
-
-
When the
Simplified_AdEMAMix
option is used withProdigy_Adv
, thed_coef
is scaled tod_coef / alpha
. This effectively tells Prodigy, "Whatever learning rate you find, you must make italpha
times smaller," which should properly align the two optimizers (tested and validated in Finetune/LoRA/Embedding). You can also use (d_limiter) instead but some users reported bad results with it.⚠️ Manual Tuning Still Required ifbeta1
Changes:
Asbeta1
increases → momentum becomes “older” → updates become more aggressive (“brute”).
Compensate by reducingd_coef
:beta1
Recommended d_coef
0.99 1.0 (default) 0.995 0.5 0.999 0.1 0.9999 0.01 📌 Why? Higher
beta1
extends the memory of the accumulator, causing larger cumulative updates, requiring further LR dampening viad_coef
.⚠️ Notice Lowerd_coef
values require more steps to find LR but it should be more stable and safe.
- Lion is a sign-based optimizer that produces large updates per step. This can also destabilize Prodigy calculations, leading to an overestimated LR. In newer versions, we added the option
d_limiter
to it. This isTrue
by default and should remain True, as it forces Prodigy to behave well with Lion, resulting in a suitable LR.
-
For optimizers (
Adam_Adv
,ADOPT_Adv
,Prodigy_Adv
): Whenbeta1
is set to0
(or left unset/None
), the optimizer will skip the first moment (standard EMA) entirely.
✅ This is fully compatible withuse_AdEMAMix
option, allowing you to use only AdEMAMix’s slow EMA while bypassing Adam’s traditional momentum. Ideal for small-batch training where standard EMA adds little to no value -
Lion is the optimizer most closely related to
AdEMAMix
/Simplified_AdEMAMix
, due to its typical setting ofbeta1 < beta2
(e.g., 0.9 and 0.99). This gives roughly 10x more weight to the raw gradient compared to historical momentum. Behavior similar toSimplified_AdEMAMix
withalpha=10
, but using a sign operation instead of adaptive learning rate scaling. -
Update Rule Differences Across Optimizers:
-
AdEMAMix
:
update = fast_EMA + (α * slow_EMA)
→ Blends short-term EMA with amplified long-term EMA. -
Simplified_AdEMAMix
:
update = (α * grad) + accumulator
→ Directly scales current gradient and adds it to a decaying accumulator. -
Adam
:
update = first_moment_EMA
→ Standard exponential moving average of gradients.
-
-
Note for skippable hyperparameters:
- When
AdEMAMix EMA
isFalse
, thebeta3 EMA
andalpha
parameters will be skipped and have no effect. - When
Simplified_AdEMAMix
isFalse
, thegrad α
parameter will be skipped and have no effect.
- When
This is the preset that worked best for me and delivered the strongest/fastest results across my tests (Finetunes/LoRAs/Embeddings):
Learning Rate: 1
optimizer: PRODIGY_Adv
settings:
- beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
- beta2: 0.999 # Since we will use k-b.
- Simplified_AdEMAMix: True
- Grad α: 100 # Controls weighting of raw gradient in update
- OrthoGrad: True
- weight_decay: 0.0
- initial_d:
• LoRA: 1e-8
• Full fine-tune: 1e-10
• Embedding: 1e-7
- d_coef: 1
- Kourkoutas Beta: True
- K-β Warmup Steps: 20 # in range of 20-200 or the number of warm-up steps you usually use for LR.
- factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
⚠️ Note: For this preset, ifPRODIGY_Adv
is too slow in raising LR at the start of the training, you can try increasinginitial_d
from 1e-8 to 1e-7 for LoRAs, and from 1e-10 to 1e-9 for fine-tuning.
⚠️ Removedd_limiter
, as it was a poor addition for some users.
✅ Why this works well:
-
beta1=0.99
stabilizes momentum while allowingSimplified_AdEMAMix
to dominate the update direction via raw gradients. -
beta2=0.999
andKourkoutas Beta
stabilize the optimizer while adaptingbeta2
per layer. -
Grad α=100
emphasizes recent gradients - ideal for small-batch, noisy regimes. -
OrthoGrad
prevents overfitting without weight decay (setweight_decay=0.0
). Or you can use weight_decay for LoRA and embeddings as it's safe for them. -
factored=True
enables 1-bit factorization of optimizer states, offering 16x–32x memory reduction with minimal accuracy loss. thanks toSimplified_AdEMAMix
’s inherent tolerance to state compression (as its updates rely heavily on raw, uncompressed gradients).