Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory MoE Config

From Leeroopedia


Knowledge Sources
Domains Mixture of Experts, Distributed Training
Last Updated 2026-02-06 19:00 GMT

Overview

Configures Mixture-of-Experts models for correct operation with DeepSpeed ZeRO-3 partitioning and enables auxiliary loss-based expert load balancing during training.

Description

This module provides two primary functions and a compatibility class for MoE model training. add_z3_leaf_module registers MoE block classes as DeepSpeed ZeRO-3 leaf modules for over 20 model architectures (DBRX, DeepSeek V2/V3, Ernie4.5, GraniteMoE, GLM4-MoE, GPT-OSS, Jamba, JetMoE, Llama4, Mixtral, OLMoE, PhiMoE, Qwen2-MoE, Qwen3-MoE, Qwen3-VL-MoE, Qwen3-Omni-MoE, and multimodal variants like Kimi-VL and InternVL 3.5), preventing ZeRO-3 from incorrectly partitioning expert weights across devices. configure_moe enables router logit output and sets the auxiliary loss coefficient for expert load balancing, with architecture-specific attribute names (router_aux_loss_coef, aux_loss_alpha, aux_loss_coef). The module also includes Qwen3OmniMoeThinkerTextSparseMoeBlock, a patched MoE block implementation that fixes DeepSpeed ZeRO-2 and FSDP2 compatibility by routing all tokens through all experts with weight-based gating.

Usage

Use add_z3_leaf_module when training MoE models with DeepSpeed ZeRO-3 to prevent weight partitioning issues. Use configure_moe when training with expert load balancing loss. Both are called automatically during model loading and configuration patching.

Code Reference

Source Location

Signature

def _set_z3_leaf_modules(
    model: "PreTrainedModel",
    leaf_modules: list[Union["nn.Module", str]],
) -> None:
    ...

def add_z3_leaf_module(model: "PreTrainedModel") -> None:
    ...

def configure_moe(
    config: "PretrainedConfig",
    model_args: "ModelArguments",
    is_trainable: bool,
) -> None:
    ...

class Qwen3OmniMoeThinkerTextSparseMoeBlock(nn.Module):
    def __init__(self, config) -> None:
        ...
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        ...

Import

from llamafactory.model.model_utils.moe import add_z3_leaf_module, configure_moe

I/O Contract

Inputs

Name Type Required Description
model PreTrainedModel Yes (for add_z3_leaf_module) The MoE model to configure for ZeRO-3 leaf module registration
config PretrainedConfig Yes (for configure_moe) Model configuration to modify with router logit and auxiliary loss settings
model_args ModelArguments Yes (for configure_moe) Model arguments containing moe_aux_loss_coef for load balancing coefficient
is_trainable bool Yes (for configure_moe) Whether the model is in training mode; configuration is skipped for inference

Outputs

Name Type Description
(side effect) None add_z3_leaf_module registers MoE blocks as ZeRO-3 leaf modules via DeepSpeed API
(side effect) None configure_moe sets output_router_logits and router_aux_loss_coef (or equivalent) on the config in-place
final_hidden_states, router_logits tuple[torch.Tensor, torch.Tensor] Output of Qwen3OmniMoeThinkerTextSparseMoeBlock.forward: processed hidden states and router logits

Usage Examples

from llamafactory.model.model_utils.moe import add_z3_leaf_module, configure_moe

# Register MoE leaf modules for DeepSpeed ZeRO-3
# (called automatically during model patching)
add_z3_leaf_module(model)

# Configure auxiliary loss for expert load balancing
# (called automatically during config patching)
configure_moe(config, model_args, is_trainable=True)

# Supported architectures for ZeRO-3 leaf registration:
# DBRX, DeepSeek V2/V3, Ernie4.5-MoE, GraniteMoE, GLM4-MoE,
# GLM4V-MoE, GPT-OSS, Jamba, JetMoE, Llama4, Mixtral, OLMoE,
# PhiMoE, Qwen2-MoE, Qwen3-MoE, Qwen3-VL-MoE, Qwen3-Omni-MoE,
# Kimi-VL, InternVL 3.5

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment