Implementation:Hiyouga LLaMA Factory MoE Config
| Knowledge Sources | |
|---|---|
| Domains | Mixture of Experts, Distributed Training |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Configures Mixture-of-Experts models for correct operation with DeepSpeed ZeRO-3 partitioning and enables auxiliary loss-based expert load balancing during training.
Description
This module provides two primary functions and a compatibility class for MoE model training. add_z3_leaf_module registers MoE block classes as DeepSpeed ZeRO-3 leaf modules for over 20 model architectures (DBRX, DeepSeek V2/V3, Ernie4.5, GraniteMoE, GLM4-MoE, GPT-OSS, Jamba, JetMoE, Llama4, Mixtral, OLMoE, PhiMoE, Qwen2-MoE, Qwen3-MoE, Qwen3-VL-MoE, Qwen3-Omni-MoE, and multimodal variants like Kimi-VL and InternVL 3.5), preventing ZeRO-3 from incorrectly partitioning expert weights across devices. configure_moe enables router logit output and sets the auxiliary loss coefficient for expert load balancing, with architecture-specific attribute names (router_aux_loss_coef, aux_loss_alpha, aux_loss_coef). The module also includes Qwen3OmniMoeThinkerTextSparseMoeBlock, a patched MoE block implementation that fixes DeepSpeed ZeRO-2 and FSDP2 compatibility by routing all tokens through all experts with weight-based gating.
Usage
Use add_z3_leaf_module when training MoE models with DeepSpeed ZeRO-3 to prevent weight partitioning issues. Use configure_moe when training with expert load balancing loss. Both are called automatically during model loading and configuration patching.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/model/model_utils/moe.py
- Lines: 1-252
Signature
def _set_z3_leaf_modules(
model: "PreTrainedModel",
leaf_modules: list[Union["nn.Module", str]],
) -> None:
...
def add_z3_leaf_module(model: "PreTrainedModel") -> None:
...
def configure_moe(
config: "PretrainedConfig",
model_args: "ModelArguments",
is_trainable: bool,
) -> None:
...
class Qwen3OmniMoeThinkerTextSparseMoeBlock(nn.Module):
def __init__(self, config) -> None:
...
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
...
Import
from llamafactory.model.model_utils.moe import add_z3_leaf_module, configure_moe
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | PreTrainedModel | Yes (for add_z3_leaf_module) | The MoE model to configure for ZeRO-3 leaf module registration |
| config | PretrainedConfig | Yes (for configure_moe) | Model configuration to modify with router logit and auxiliary loss settings |
| model_args | ModelArguments | Yes (for configure_moe) | Model arguments containing moe_aux_loss_coef for load balancing coefficient |
| is_trainable | bool | Yes (for configure_moe) | Whether the model is in training mode; configuration is skipped for inference |
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | None | add_z3_leaf_module registers MoE blocks as ZeRO-3 leaf modules via DeepSpeed API |
| (side effect) | None | configure_moe sets output_router_logits and router_aux_loss_coef (or equivalent) on the config in-place |
| final_hidden_states, router_logits | tuple[torch.Tensor, torch.Tensor] | Output of Qwen3OmniMoeThinkerTextSparseMoeBlock.forward: processed hidden states and router logits |
Usage Examples
from llamafactory.model.model_utils.moe import add_z3_leaf_module, configure_moe
# Register MoE leaf modules for DeepSpeed ZeRO-3
# (called automatically during model patching)
add_z3_leaf_module(model)
# Configure auxiliary loss for expert load balancing
# (called automatically during config patching)
configure_moe(config, model_args, is_trainable=True)
# Supported architectures for ZeRO-3 leaf registration:
# DBRX, DeepSeek V2/V3, Ernie4.5-MoE, GraniteMoE, GLM4-MoE,
# GLM4V-MoE, GPT-OSS, Jamba, JetMoE, Llama4, Mixtral, OLMoE,
# PhiMoE, Qwen2-MoE, Qwen3-MoE, Qwen3-VL-MoE, Qwen3-Omni-MoE,
# Kimi-VL, InternVL 3.5
Related Pages
- Hiyouga_LLaMA_Factory_Model_Loader - Invokes MoE configuration during the model loading pipeline
- Hiyouga_LLaMA_Factory_KTransformers_Integration - KTransformers provides alternative MoE model loading for CPU/GPU hybrid
- Hiyouga_LLaMA_Factory_Liger_Kernel - Liger Kernel provides complementary optimization for MoE models like Mixtral and Qwen3-MoE