Orthogonal Optimizers - Nerogar/OneTrainer GitHub Wiki

Orthogonal optimizers are a new family of optimizers that train neural networks differently than traditional optimizers like AdamW or SGD.

The Simple Explanation

Traditional optimizers (e.g., Adam, Prodigy, SGD) look at every single number (parameter) individually. They ask, "Should this specific number go up or down?" Orthogonal Optimizers look at the parameters of a layer as a whole group (a matrix). They force the update to be "balanced" across the whole layer. This prevents the model from fixating on one easy pattern while ignoring other important details, often leading to better quality training in less time.

ℹ️ Note: $\mathbf{Muon}$ cannot handle 1D parameters (biases, normalization scales, embeddings, etc.). Therefore, $\mathbf{Muon}$ must be paired with an auxiliary optimizer (AdamW) to handle these tensors.

OneTrainer handles all this complexity for you. You do not need to separate your layers manually; these optimizers automatically detect which layers should use $\mathbf{Muon}$ and which should use a standard optimizer (AdamW) when using MuonWithAuxAdam.

🛠️ Included Optimizers

Optimizer	Description
`MUON`	The original $\mathbf{Muon}$ with a basic Auxiliary AdamW.
`MUON_ADV`	Advanced $\mathbf{Muon}$ implementation with Auxiliary AdamW_Adv.
`ADAMUON_ADV`	Sign-based orthogonalization with adaptive second-estimation scaling.

$\mathbf{Muon}$ Optimizer

$\mathbf{Muon}$ (MomentUm Orthogonalized by Newton-Schulz) was introduced by Keller Jordan. It is designed specifically for the internal 2D (matrix) transformations of a neural network (e.g., Linear layers, reshaped Conv2D).

It combines the basic $\mathbf{Muon}$ optimizer with a basic Auxiliary Adam.

Advanced Variants

These are advanced variants of $\mathbf{Muon}$ with additional improvements and settings.

Optimizer	Description
`Muon_adv`	Advanced $\mathbf{Muon}$ implementation with CANS, NorMuon, Low-Rank ortho, and other features.
`AdaMuon_adv`	Advanced AdaMuon implementation; Combines $\mathbf{Muon}$'s geometry with Adam's adaptive scaling and sign-based orthogonalization. Downside: It introduces a new optimizer state (doubling the memory footprint compared to standard Muon).

Note: These variants use AdamW_Adv as the auxiliary optimizer, inheriting all its features.

⚙️ Configuration Options

NOTE: Universal Safe Features from the advanced optimizers page also apply here.

Feature	Description	Recommended Usage	Performance Impact	Theoretical Basis	Compatibility
Newton-Schulz Iterations	The "Newton-Schulz" step is the heart of $\mathbf{Muon}$; it balances the update matrix. This controls how hard the optimizer tries to make the update "perfect".	Default: 5. Leave at 5.	Lowering it (3-4) saves a tiny bit of speed but reduces quality. Raising it (6+) is usually wasted effort.	Newton-Schulz	All
Accelerated Newton-Schulz	Enables Chebyshev-Accelerated Newton-Schulz (CANS). Instead of using static math, it calculates the optimal formula dynamically at every step.	Default: Disabled. If used, set Newton-Schulz Iterations to 7 (7 in CANS $\approx$ 5 when disabled).	Makes the "cleaning" process much faster and more accurate. Reaches better orthogonality in fewer steps.	Chebyshev Acceleration	Advanced Variants
Low-Rank Orthogonalization	Instead of orthogonalizing the full matrix, sketches the matrix to a smaller size (`Ortho Rank`), orthogonalizes that, and projects back.	Default: Disabled. Use for full finetuning; disable for LoRAs/OFTs.	Significantly faster step times on very large matrices (e.g., full finetuning) and increased robustness to noise.	Matrix Sketching	Advanced Variants
Nesterov Momentum	Applies Nesterov's "lookahead" property. It mixes the momentum with the raw gradient for improved training.	Default: True. Keep enabled.	Improves convergence speed and stability.	Nesterov Acceleration	All. • Original $\mathbf{Muon}$ uses it by default. • Disabled if `Simplified AdEMAMix` is active.
Simplified AdEMAMix	Incorporates long-term momentum with an emphasis on recent gradients. Retains the same LR as without it due to $\mathbf{Muon}$ normalization.	Default: Disabled. Use it with AdaMuon_adv (uses sign of update, ignores magnitude).	Performs better in small batch sizes with the right tuning. Mitigates optimizer "forgetting" via long-term momentum.	Simplified AdEMAMix	Advanced Variants
RMS Rescaling	Scales the internal updates to match AdamW.	Default: True. Keep ON. Allows you to use standard Learning Rates.	Enables usage of standard Learning Rates you are accustomed to.	RMS Normalization	Advanced Variants
NorMuon Variant	Adds a "Normalization" step to the neurons to keep orthogonal updates consistent and prevent magnitude drift.	Default: True. Recommended for all cases of Muon_adv and AdaMuon_adv.	Makes $\mathbf{Muon}$ have the same variance as Adam. Massive VRAM savings for AdaMuon (converts state to scalar).	NorMuon	Advanced Variants
Approx MARS-M	Adds a "correction" term to the momentum, making it reactive to changes in loss landscape, by calculating the difference between previous step and current step gradients.	Default: False. Recommended if you have the spare VRAM for the additional state.	More robust and stable training for small BS. requires additional state to store the previous step's gradient.	MARS-M	Advanced Variants

📊 $\mathbf{Muon}$ and Optimal Batch Size

According to a recent paper, $\mathbf{Muon}$ demonstrates a relationship where decreasing the batch size requires increasing $\beta_1$ (the lifespan of momentum).

Essentially, you need a higher $\beta_1$ when your batch size is smaller and your dataset is larger.

Rough Example: If you have a batch size of 1 and 800 images (with $\text{repeat}=1$), it is beneficial to have a $\beta_1$ of around 0.999. This ensures the momentum effectively covers your dataset (roughly a 1000-step memory). You could even increase it further to cover 2-4 epochs, though finding the optimal $\beta_1$ requires experimentation. These values are approximate.

📉 The Downside of $\beta_1$ and Approx MARS-M

The main difficulty in tuning $\beta_1$ is the inherent trade-off:

Low $\beta_1$ (e.g., 0.95) $\rightarrow$ High Responsiveness (quick adaptation to recent gradients).
High $\beta_1$ (e.g., 0.999) $\rightarrow$ High Smoothing (effective reduction of gradient noise/longer memory).

Optimally, small batch sizes (e.g., $\mathbf{<512}$) require both responsiveness (to adapt to limited information) and smoothing (to mitigate the higher relative noise).

NOTE: This trade-off is the core idea addressed by Simplified AdEMAMix, which mixes long-term momentum (for noise smoothing) with raw current gradients (for responsiveness).

Approx MARS-M is designed to balance and combine these two desirable characteristics. It modifies the momentum update by effectively incorporating the difference between the current gradient step and the previous gradient step. This mechanism makes the momentum reactive to sudden changes in the landscape while simultaneously retaining its long-term memory.

Trade-off: Approx MARS-M requires additional state to store the previous step's gradient. However, this is negligible for LoRA training or if spare VRAM is available during full fine-tuning.

$\mathbf{Muon}$ and Weight Decay (Regularization)

$\mathbf{Muon}$ trains fast; however, without weight decay or regularization, it might overfit, fail to reach the optimal result, or achieve poor generalization. You should consider one of the following:

Use Weight Decay: Typical values range from $0.01$ to $0.5$.
Use OrthoGrad: OrthoGrad introduces slight overhead but achieves the benefits of weight decay without the need for tuning (you simply turn it on).

✅ Quick Usage Configs

The "I just want it to work" Config:

Optimizer: Default Muon_adv
Learning Rate: (Same as you use for AdamW)
⚠️ NOTE: You can use the original $\mathbf{Muon}$, but it lacks the RMS rescaling option. This leads to a different LR scale than AdamW (roughly requiring 10× higher LR and 10× smaller weight decay).

The "I have low VRAM" Config:

Optimizer: AdaMuon_adv (More accurate than normal $\mathbf{Muon}$ due to sign-based updates).
Fused Backpass: True (Saves gradient memory footprint).
Factored Optimizer: True (Compresses state).
AuxAdam Factored Optimizer: True (Compresses the AuxAdam state too).
NorMuon Variant: True (Converts it to a 1-state optimizer instead of 2-states).
Others: Same as your usual/recommended settings.

The "My Batch Size is Small (1-16)" Config:

Optimizer: AdaMuon_adv (AdaMuon pairs very well with Simplified AdEMAMix).
Simplified AdEMAMix: True
a Grad: 100.0 (High responsiveness).
beta1: 0.99 to 0.9999 (Long-term momentum, depends on training length).
Others: Same as your usual/recommended settings.
⚠️ NOTE: This is intended for long runs with a small batch size (noisy training). As an approximation, I would recommend a minimum of 100 effective steps per epoch for this preset (where effective steps = number of images / batch size). For anything less than that, the default values of 0.95 Beta1 and Nesterov will work just fine.