Principle:Hiyouga LLaMA Factory Mixture of Experts
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Model Architecture |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Mixture of Experts (MoE) is a sparse architecture paradigm where each input token is routed to a subset of specialized expert networks, enabling models to scale to trillions of parameters while maintaining tractable computation per forward pass.
Description
Traditional dense transformer models activate all parameters for every input token. MoE architectures replace the standard feed-forward network (FFN) in each transformer layer with multiple parallel expert networks and a gating (router) network that selects which experts process each token. Only the top- experts are activated per token, making the computational cost proportional to rather than the total number of experts.
This sparsity allows MoE models to have a very large total parameter count (contributing to capacity) while keeping per-token FLOPs comparable to a much smaller dense model. For example, Mixtral 8x7B has 8 experts per layer but only activates 2, giving it the capacity of a ~47B parameter model with the inference cost of approximately a 13B model.
LLaMA-Factory provides comprehensive MoE support through:
- Router auxiliary loss configuration (
configure_moe), which enables the load-balancing loss that prevents expert collapse (all tokens routed to the same few experts). The auxiliary loss coefficient is configurable viamoe_aux_loss_coef. - DeepSpeed Zero3 leaf module registration (
add_z3_leaf_module), which marks MoE blocks as leaf modules in DeepSpeed's partitioning scheme. This prevents DeepSpeed from splitting individual expert FFNs across devices, which would destroy the sparse routing structure. - NPU-optimized MoE kernels that use fused operations including
npu_moe_token_permute, grouped matrix multiplication (npu_grouped_matmul), andnpu_moe_token_unpermutefor efficient expert dispatch and combination on Ascend hardware.
The framework supports a wide range of MoE architectures including Mixtral, DBRX, DeepSeek-V2/V3, Qwen2-MoE, Qwen3-MoE, JetMoE, Jamba, OLMoE, PhiMoE, GraniteMoE, Llama-4, GLM4-MoE, and Ernie4.5-MoE.
Usage
MoE configuration is relevant when:
- Fine-tuning MoE-based models where load balancing is critical for training stability. Set
moe_aux_loss_coefto a small positive value (e.g., 0.01) to enable the auxiliary loss. - Using DeepSpeed Zero3 for distributed training of MoE models, where leaf module registration is required to prevent incorrect parameter partitioning.
- Deploying on NPU hardware where fused MoE kernels can significantly accelerate expert dispatch and computation.
Theoretical Basis
The MoE layer replaces the standard FFN with a gated combination of expert networks. For an input token , the output is:
where is the -th expert network and is the gating weight for expert .
The gating function produces routing weights via a softmax over a linear projection:
Only the top- experts are selected, and the remaining gates are set to zero:
The auxiliary load-balancing loss prevents router collapse:
where is the fraction of tokens routed to expert and is the mean routing probability for expert . Minimizing this loss encourages uniform token distribution across experts.
In the codebase, the auxiliary loss is enabled by setting output_router_logits=True on the model configuration:
def configure_moe(config, model_args, is_trainable):
if model_type in ["mixtral", "qwen2_moe", ...]:
setattr(config, "output_router_logits", True)
setattr(config, "router_aux_loss_coef", model_args.moe_aux_loss_coef)
For DeepSpeed Zero3 compatibility, MoE blocks must be registered as leaf modules to prevent parameter sharding from splitting individual experts:
def add_z3_leaf_module(model):
if model_type == "mixtral":
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
The NPU-optimized implementation uses fused token permutation and grouped matrix multiplication to avoid the overhead of dispatching tokens to individual experts sequentially:
# Permute tokens according to expert assignment
permuted_states, row_ids_map = torch_npu.npu_moe_token_permute(
hidden_states, router_indices.to(torch.int32)
)
# Grouped matmul across all experts simultaneously
intermediate = GmmFunction.apply(permuted_states, self.gate_up_proj, tokens_per_expert)
activated = torch_npu.npu_swiglu(intermediate, dim=-1)
output = GmmFunction.apply(activated, self.down_proj, tokens_per_expert)
# Unpermute and weight by routing probabilities
result = torch_npu.npu_moe_token_unpermute(output, row_ids_map, probs=routing_weights)