Principle:Hiyouga LLaMA Factory Muon Optimization

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Optimization, Deep Learning
Last Updated	2026-02-06 19:00 GMT

Overview

Muon (MomentUm Orthogonalized by Newton-schulz) is an optimizer that applies Newton-Schulz orthogonalization to momentum-based gradient updates, producing near-orthogonal update matrices that improve training dynamics for 2D parameter matrices.

Description

Standard optimizers like Adam or SGD produce update matrices whose singular value spectra can vary widely, leading to uneven learning across different directions in parameter space. Muon addresses this by orthogonalizing the update: after computing the standard SGD-momentum update, it replaces the update with the nearest orthogonal matrix (in terms of the Frobenius norm). This ensures that the update has a flat singular value spectrum, meaning all directions in parameter space are updated with equal magnitude.

The orthogonalization is performed efficiently using a quintic Newton-Schulz iteration, which converges to the orthogonal factor of the polar decomposition in approximately 5 iterations. The iteration is numerically stable in bfloat16, avoiding the need for higher-precision computation.

In LLaMA-Factory's implementation, Muon handles two types of parameters:

2D weight matrices (linear layers): These receive Muon's orthogonalized updates with learning rate scaling based on matrix dimensions.
Non-2D parameters and embedding/head layers: These fall back to a standard AdamW optimizer with separate hyperparameters.

This hybrid approach recognizes that orthogonalization is mathematically meaningful only for 2D matrices, while 1D parameters (biases, layer norms) and embedding layers benefit from standard adaptive optimization.

Usage

Muon should be considered when:

Training large models from scratch where the optimizer's update quality has a significant impact on convergence.
Working with architectures composed primarily of linear layers (transformers).
Batch sizes are sufficiently large (Muon may not work well with small batch sizes per the authors' guidance).
Note: The authors caution that Muon may not be ideal for fine-tuning pretrained models, though this has not been extensively validated.

Theoretical Basis

For a gradient $G \in ℝ^{m \times n}$ , the Muon update proceeds in three steps:

Step 1: Momentum accumulation

$B_{t} = μ \cdot B_{t - 1} + G_{t}$

where $μ$ is the momentum coefficient (default 0.95). With Nesterov momentum:

${\hat{G}}_{t} = G_{t} + μ \cdot B_{t}$

Step 2: Newton-Schulz orthogonalization

Given the momentum-accumulated gradient $\hat{G}$ , compute its zeroth power (the orthogonal factor $U$ from the polar decomposition $\hat{G} = U S$ ). This is done via a quintic Newton-Schulz iteration that maximizes convergence speed:

$X_{0} = \hat{G} / ‖ \hat{G} ‖_{F}$

$A_{k} = X_{k} X_{k}^{T}$

$B_{k} = b \cdot A_{k} + c \cdot A_{k}^{2}$

$X_{k + 1} = a \cdot X_{k} + B_{k} \cdot X_{k}$

with optimized coefficients $a = 3.4445$ , $b = - 4.7750$ , $c = 2.0315$ . After 5 iterations, $X_{5} \approx U V^{T}$ where $\hat{G} = U Σ V^{T}$ is the SVD.

def zeropower_via_newtonschulz5(G: torch.Tensor, steps: int) -> torch.Tensor:
    a, b, c = (3.4445, -4.7750, 2.0315)
    X = G.bfloat16()
    if G.size(0) > G.size(1):
        X = X.T
    X = X / (X.norm() + 1e-7)
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X
    if G.size(0) > G.size(1):
        X = X.T
    return X

The coefficients are chosen to maximize the slope at zero of the iteration's convergence profile, which empirically produces updates where $S_{i^{'} i} \sim Uniform (0.5, 1.5)$ rather than exactly 1.0. This approximate orthogonalization turns out to perform as well as exact $U V^{T}$ for model training.

Step 3: Scaled update

The learning rate is adjusted based on matrix dimensions to account for the spectral properties of the orthogonal update:

${lr}_{adjusted} = lr \times 0.2 \times \sqrt{\max (m, n)}$

def adjust_lr_for_muon(self, lr, param_shape):
    A, B = param_shape[:2]
    adjusted_ratio = 0.2 * math.sqrt(max(A, B))
    return lr * adjusted_ratio

Weight decay is applied multiplicatively before the update, and the final parameter update is:

$θ_{t} = (1 - lr \cdot wd) \cdot θ_{t - 1} - {lr}_{adjusted} \cdot X_{5}$

For AdamW fallback parameters, standard bias-corrected Adam updates are applied:

$m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$

$v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$

$θ_{t} = (1 - lr \cdot wd) \cdot θ_{t - 1} - \frac{lr}{scale} \cdot \frac{m_{t}}{ϵ + \sqrt{v_{t}}}$

where $scale = \frac{1 - β_{1}^{t}}{\sqrt{1 - β_{2}^{t}}}$ .

Related Pages

Implementation:Hiyouga_LLaMA_Factory_Muon_Optimizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment