Implementation:Hiyouga LLaMA Factory MoE Config

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Mixture of Experts, Distributed Training
Last Updated	2026-02-06 19:00 GMT

Overview

Configures Mixture-of-Experts models for correct operation with DeepSpeed ZeRO-3 partitioning and enables auxiliary loss-based expert load balancing during training.

Description

This module provides two primary functions and a compatibility class for MoE model training. add_z3_leaf_module registers MoE block classes as DeepSpeed ZeRO-3 leaf modules for over 20 model architectures (DBRX, DeepSeek V2/V3, Ernie4.5, GraniteMoE, GLM4-MoE, GPT-OSS, Jamba, JetMoE, Llama4, Mixtral, OLMoE, PhiMoE, Qwen2-MoE, Qwen3-MoE, Qwen3-VL-MoE, Qwen3-Omni-MoE, and multimodal variants like Kimi-VL and InternVL 3.5), preventing ZeRO-3 from incorrectly partitioning expert weights across devices. configure_moe enables router logit output and sets the auxiliary loss coefficient for expert load balancing, with architecture-specific attribute names (router_aux_loss_coef, aux_loss_alpha, aux_loss_coef). The module also includes Qwen3OmniMoeThinkerTextSparseMoeBlock, a patched MoE block implementation that fixes DeepSpeed ZeRO-2 and FSDP2 compatibility by routing all tokens through all experts with weight-based gating.

Usage

Use add_z3_leaf_module when training MoE models with DeepSpeed ZeRO-3 to prevent weight partitioning issues. Use configure_moe when training with expert load balancing loss. Both are called automatically during model loading and configuration patching.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/model/model_utils/moe.py
Lines: 1-252

Signature

def _set_z3_leaf_modules(
    model: "PreTrainedModel",
    leaf_modules: list[Union["nn.Module", str]],
) -> None:
    ...

def add_z3_leaf_module(model: "PreTrainedModel") -> None:
    ...

def configure_moe(
    config: "PretrainedConfig",
    model_args: "ModelArguments",
    is_trainable: bool,
) -> None:
    ...

class Qwen3OmniMoeThinkerTextSparseMoeBlock(nn.Module):
    def __init__(self, config) -> None:
        ...
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        ...

Import

from llamafactory.model.model_utils.moe import add_z3_leaf_module, configure_moe

I/O Contract

Inputs

Name	Type	Required	Description
model	PreTrainedModel	Yes (for add_z3_leaf_module)	The MoE model to configure for ZeRO-3 leaf module registration
config	PretrainedConfig	Yes (for configure_moe)	Model configuration to modify with router logit and auxiliary loss settings
model_args	ModelArguments	Yes (for configure_moe)	Model arguments containing moe_aux_loss_coef for load balancing coefficient
is_trainable	bool	Yes (for configure_moe)	Whether the model is in training mode; configuration is skipped for inference

Outputs

Name	Type	Description
(side effect)	None	add_z3_leaf_module registers MoE blocks as ZeRO-3 leaf modules via DeepSpeed API
(side effect)	None	configure_moe sets output_router_logits and router_aux_loss_coef (or equivalent) on the config in-place
final_hidden_states, router_logits	tuple[torch.Tensor, torch.Tensor]	Output of Qwen3OmniMoeThinkerTextSparseMoeBlock.forward: processed hidden states and router logits

Usage Examples

from llamafactory.model.model_utils.moe import add_z3_leaf_module, configure_moe

# Register MoE leaf modules for DeepSpeed ZeRO-3
# (called automatically during model patching)
add_z3_leaf_module(model)

# Configure auxiliary loss for expert load balancing
# (called automatically during config patching)
configure_moe(config, model_args, is_trainable=True)

# Supported architectures for ZeRO-3 leaf registration:
# DBRX, DeepSeek V2/V3, Ernie4.5-MoE, GraniteMoE, GLM4-MoE,
# GLM4V-MoE, GPT-OSS, Jamba, JetMoE, Llama4, Mixtral, OLMoE,
# PhiMoE, Qwen2-MoE, Qwen3-MoE, Qwen3-VL-MoE, Qwen3-Omni-MoE,
# Kimi-VL, InternVL 3.5

Related Pages

Hiyouga_LLaMA_Factory_Model_Loader - Invokes MoE configuration during the model loading pipeline
Hiyouga_LLaMA_Factory_KTransformers_Integration - KTransformers provides alternative MoE model loading for CPU/GPU hybrid
Hiyouga_LLaMA_Factory_Liger_Kernel - Liger Kernel provides complementary optimization for MoE models like Mixtral and Qwen3-MoE

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment