Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Hiyouga LLaMA Factory Muon Optimization

From Leeroopedia


Knowledge Sources
Domains Optimization, Deep Learning
Last Updated 2026-02-06 19:00 GMT

Overview

Muon (MomentUm Orthogonalized by Newton-schulz) is an optimizer that applies Newton-Schulz orthogonalization to momentum-based gradient updates, producing near-orthogonal update matrices that improve training dynamics for 2D parameter matrices.

Description

Standard optimizers like Adam or SGD produce update matrices whose singular value spectra can vary widely, leading to uneven learning across different directions in parameter space. Muon addresses this by orthogonalizing the update: after computing the standard SGD-momentum update, it replaces the update with the nearest orthogonal matrix (in terms of the Frobenius norm). This ensures that the update has a flat singular value spectrum, meaning all directions in parameter space are updated with equal magnitude.

The orthogonalization is performed efficiently using a quintic Newton-Schulz iteration, which converges to the orthogonal factor of the polar decomposition in approximately 5 iterations. The iteration is numerically stable in bfloat16, avoiding the need for higher-precision computation.

In LLaMA-Factory's implementation, Muon handles two types of parameters:

  • 2D weight matrices (linear layers): These receive Muon's orthogonalized updates with learning rate scaling based on matrix dimensions.
  • Non-2D parameters and embedding/head layers: These fall back to a standard AdamW optimizer with separate hyperparameters.

This hybrid approach recognizes that orthogonalization is mathematically meaningful only for 2D matrices, while 1D parameters (biases, layer norms) and embedding layers benefit from standard adaptive optimization.

Usage

Muon should be considered when:

  • Training large models from scratch where the optimizer's update quality has a significant impact on convergence.
  • Working with architectures composed primarily of linear layers (transformers).
  • Batch sizes are sufficiently large (Muon may not work well with small batch sizes per the authors' guidance).
  • Note: The authors caution that Muon may not be ideal for fine-tuning pretrained models, though this has not been extensively validated.

Theoretical Basis

For a gradient Gm×n, the Muon update proceeds in three steps:

Step 1: Momentum accumulation

Bt=μBt1+Gt

where μ is the momentum coefficient (default 0.95). With Nesterov momentum:

G^t=Gt+μBt

Step 2: Newton-Schulz orthogonalization

Given the momentum-accumulated gradient G^, compute its zeroth power (the orthogonal factor U from the polar decomposition G^=US). This is done via a quintic Newton-Schulz iteration that maximizes convergence speed:

X0=G^/G^F

Ak=XkXkT

Bk=bAk+cAk2

Xk+1=aXk+BkXk

with optimized coefficients a=3.4445, b=4.7750, c=2.0315. After 5 iterations, X5UVT where G^=UΣVT is the SVD.

def zeropower_via_newtonschulz5(G: torch.Tensor, steps: int) -> torch.Tensor:
    a, b, c = (3.4445, -4.7750, 2.0315)
    X = G.bfloat16()
    if G.size(0) > G.size(1):
        X = X.T
    X = X / (X.norm() + 1e-7)
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X
    if G.size(0) > G.size(1):
        X = X.T
    return X

The coefficients are chosen to maximize the slope at zero of the iteration's convergence profile, which empirically produces updates where SiiUniform(0.5,1.5) rather than exactly 1.0. This approximate orthogonalization turns out to perform as well as exact UVT for model training.

Step 3: Scaled update

The learning rate is adjusted based on matrix dimensions to account for the spectral properties of the orthogonal update:

lradjusted=lr×0.2×max(m,n)

def adjust_lr_for_muon(self, lr, param_shape):
    A, B = param_shape[:2]
    adjusted_ratio = 0.2 * math.sqrt(max(A, B))
    return lr * adjusted_ratio

Weight decay is applied multiplicatively before the update, and the final parameter update is:

θt=(1lrwd)θt1lradjustedX5

For AdamW fallback parameters, standard bias-corrected Adam updates are applied:

mt=β1mt1+(1β1)gt

vt=β2vt1+(1β2)gt2

θt=(1lrwd)θt1lrscalemtϵ+vt

where scale=1β1t1β2t.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment