Principle:Hiyouga LLaMA Factory Mixture of Experts

Knowledge Sources	Hiyouga_LLaMA_Factory Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Domains	Deep Learning, Model Architecture
Last Updated	2026-02-06 19:00 GMT

Overview

Mixture of Experts (MoE) is a sparse architecture paradigm where each input token is routed to a subset of specialized expert networks, enabling models to scale to trillions of parameters while maintaining tractable computation per forward pass.

Description

Traditional dense transformer models activate all parameters for every input token. MoE architectures replace the standard feed-forward network (FFN) in each transformer layer with multiple parallel expert networks and a gating (router) network that selects which experts process each token. Only the top- $k$ experts are activated per token, making the computational cost proportional to $k$ rather than the total number of experts.

This sparsity allows MoE models to have a very large total parameter count (contributing to capacity) while keeping per-token FLOPs comparable to a much smaller dense model. For example, Mixtral 8x7B has 8 experts per layer but only activates 2, giving it the capacity of a ~47B parameter model with the inference cost of approximately a 13B model.

LLaMA-Factory provides comprehensive MoE support through:

Router auxiliary loss configuration (configure_moe), which enables the load-balancing loss that prevents expert collapse (all tokens routed to the same few experts). The auxiliary loss coefficient is configurable via moe_aux_loss_coef.
DeepSpeed Zero3 leaf module registration (add_z3_leaf_module), which marks MoE blocks as leaf modules in DeepSpeed's partitioning scheme. This prevents DeepSpeed from splitting individual expert FFNs across devices, which would destroy the sparse routing structure.
NPU-optimized MoE kernels that use fused operations including npu_moe_token_permute, grouped matrix multiplication (npu_grouped_matmul), and npu_moe_token_unpermute for efficient expert dispatch and combination on Ascend hardware.

The framework supports a wide range of MoE architectures including Mixtral, DBRX, DeepSeek-V2/V3, Qwen2-MoE, Qwen3-MoE, JetMoE, Jamba, OLMoE, PhiMoE, GraniteMoE, Llama-4, GLM4-MoE, and Ernie4.5-MoE.

Usage

MoE configuration is relevant when:

Fine-tuning MoE-based models where load balancing is critical for training stability. Set moe_aux_loss_coef to a small positive value (e.g., 0.01) to enable the auxiliary loss.
Using DeepSpeed Zero3 for distributed training of MoE models, where leaf module registration is required to prevent incorrect parameter partitioning.
Deploying on NPU hardware where fused MoE kernels can significantly accelerate expert dispatch and computation.

Theoretical Basis

The MoE layer replaces the standard FFN with a gated combination of $N$ expert networks. For an input token $x$ , the output is:

$y = \sum_{i = 1}^{N} g_{i} (x) \cdot E_{i} (x)$

where $E_{i}$ is the $i$ -th expert network and $g_{i} (x)$ is the gating weight for expert $i$ .

The gating function produces routing weights via a softmax over a linear projection:

$g (x) = softmax (W_{g} \cdot x)$

Only the top- $k$ experts are selected, and the remaining gates are set to zero:

$g_{i} (x) = {\begin{cases} \frac{softmax (W_{g} \cdot x)_{i}}{\sum_{j \in TopK} softmax (W_{g} \cdot x)_{j}} & if i \in TopK (g (x), k) \\ 0 & otherwise \end{cases}$

The auxiliary load-balancing loss prevents router collapse:

$ℒ_{aux} = N \cdot \sum_{i = 1}^{N} f_{i} \cdot P_{i}$

where $f_{i}$ is the fraction of tokens routed to expert $i$ and $P_{i}$ is the mean routing probability for expert $i$ . Minimizing this loss encourages uniform token distribution across experts.

In the codebase, the auxiliary loss is enabled by setting output_router_logits=True on the model configuration:

def configure_moe(config, model_args, is_trainable):
    if model_type in ["mixtral", "qwen2_moe", ...]:
        setattr(config, "output_router_logits", True)
        setattr(config, "router_aux_loss_coef", model_args.moe_aux_loss_coef)

For DeepSpeed Zero3 compatibility, MoE blocks must be registered as leaf modules to prevent parameter sharding from splitting individual experts:

def add_z3_leaf_module(model):
    if model_type == "mixtral":
        from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
        set_z3_leaf_modules(model, [MixtralSparseMoeBlock])

The NPU-optimized implementation uses fused token permutation and grouped matrix multiplication to avoid the overhead of dispatching tokens to individual experts sequentially:

# Permute tokens according to expert assignment
permuted_states, row_ids_map = torch_npu.npu_moe_token_permute(
    hidden_states, router_indices.to(torch.int32)
)
# Grouped matmul across all experts simultaneously
intermediate = GmmFunction.apply(permuted_states, self.gate_up_proj, tokens_per_expert)
activated = torch_npu.npu_swiglu(intermediate, dim=-1)
output = GmmFunction.apply(activated, self.down_proj, tokens_per_expert)
# Unpermute and weight by routing probabilities
result = torch_npu.npu_moe_token_unpermute(output, row_ids_map, probs=routing_weights)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment