Principle:Hiyouga LLaMA Factory Muon Optimization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep Learning |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Muon (MomentUm Orthogonalized by Newton-schulz) is an optimizer that applies Newton-Schulz orthogonalization to momentum-based gradient updates, producing near-orthogonal update matrices that improve training dynamics for 2D parameter matrices.
Description
Standard optimizers like Adam or SGD produce update matrices whose singular value spectra can vary widely, leading to uneven learning across different directions in parameter space. Muon addresses this by orthogonalizing the update: after computing the standard SGD-momentum update, it replaces the update with the nearest orthogonal matrix (in terms of the Frobenius norm). This ensures that the update has a flat singular value spectrum, meaning all directions in parameter space are updated with equal magnitude.
The orthogonalization is performed efficiently using a quintic Newton-Schulz iteration, which converges to the orthogonal factor of the polar decomposition in approximately 5 iterations. The iteration is numerically stable in bfloat16, avoiding the need for higher-precision computation.
In LLaMA-Factory's implementation, Muon handles two types of parameters:
- 2D weight matrices (linear layers): These receive Muon's orthogonalized updates with learning rate scaling based on matrix dimensions.
- Non-2D parameters and embedding/head layers: These fall back to a standard AdamW optimizer with separate hyperparameters.
This hybrid approach recognizes that orthogonalization is mathematically meaningful only for 2D matrices, while 1D parameters (biases, layer norms) and embedding layers benefit from standard adaptive optimization.
Usage
Muon should be considered when:
- Training large models from scratch where the optimizer's update quality has a significant impact on convergence.
- Working with architectures composed primarily of linear layers (transformers).
- Batch sizes are sufficiently large (Muon may not work well with small batch sizes per the authors' guidance).
- Note: The authors caution that Muon may not be ideal for fine-tuning pretrained models, though this has not been extensively validated.
Theoretical Basis
For a gradient , the Muon update proceeds in three steps:
Step 1: Momentum accumulation
where is the momentum coefficient (default 0.95). With Nesterov momentum:
Step 2: Newton-Schulz orthogonalization
Given the momentum-accumulated gradient , compute its zeroth power (the orthogonal factor from the polar decomposition ). This is done via a quintic Newton-Schulz iteration that maximizes convergence speed:
with optimized coefficients , , . After 5 iterations, where is the SVD.
def zeropower_via_newtonschulz5(G: torch.Tensor, steps: int) -> torch.Tensor:
a, b, c = (3.4445, -4.7750, 2.0315)
X = G.bfloat16()
if G.size(0) > G.size(1):
X = X.T
X = X / (X.norm() + 1e-7)
for _ in range(steps):
A = X @ X.T
B = b * A + c * A @ A
X = a * X + B @ X
if G.size(0) > G.size(1):
X = X.T
return X
The coefficients are chosen to maximize the slope at zero of the iteration's convergence profile, which empirically produces updates where rather than exactly 1.0. This approximate orthogonalization turns out to perform as well as exact for model training.
Step 3: Scaled update
The learning rate is adjusted based on matrix dimensions to account for the spectral properties of the orthogonal update:
def adjust_lr_for_muon(self, lr, param_shape):
A, B = param_shape[:2]
adjusted_ratio = 0.2 * math.sqrt(max(A, B))
return lr * adjusted_ratio
Weight decay is applied multiplicatively before the update, and the final parameter update is:
For AdamW fallback parameters, standard bias-corrected Adam updates are applied:
where .