Implementation:Hiyouga LLaMA Factory NPU Fused MoE

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Machine Learning, Hardware Acceleration, NPU
Last Updated	2026-02-06 19:00 GMT

Overview

NPU-optimized fused Mixture-of-Experts (MoE) kernel that replaces standard MoE forward passes with Huawei NPU-native grouped matrix multiplication operations for significant performance improvement.

Description

npu_fused_moe.py implements hardware-accelerated MoE computation for Huawei NPU (Ascend) devices. The module provides:

GmmFunction: A custom PyTorch autograd function implementing grouped matrix multiplication (GMM) using torch_npu.npu_grouped_matmul. Supports forward and backward passes with proper gradient computation for training.
HybridGmmFunction: An alternative autograd function for hybrid grouped matrix multiplication that processes per-expert input/weight lists without requiring a cumulative group_list. Used specifically for Qwen3 MoE architectures where experts have individual weight matrices.
NpuMoeFused: A container class with static methods implementing NPU-fused forward passes for standard MoE architectures:
- npu_moe_experts_forward: Replaces the experts' forward using npu_moe_token_permute, GmmFunction-based grouped matmul with SwiGLU activation, and npu_moe_token_unpermute.
- npu_moe_sparse_block_forward: Replaces the sparse MoE block forward with gate computation, top-k routing, and fused expert execution.
Qwen3NpuMoeFused: Specialized container for Qwen3 MoE architectures using HybridGmmFunction for per-expert computation with SiLU activation.
NpuFusedMoEKernel: The registered kernel class (kernel_id: "npu_fused_moe") that applies MoE patches based on model architecture. Uses a kernel_moe_mapping configuration to map model architecture names to their corresponding patched forward functions.

The module supports Qwen3MoE, Qwen3VLMoE, and similar MoE architectures, with version-aware handling for different transformers versions.

Usage

This kernel is automatically discovered and registered by the kernel interface's scan_all_kernels function. It is applied when running MoE models on NPU hardware with kernels enabled. No manual invocation is needed in typical usage.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/v1/plugins/model_plugins/kernels/ops/mlp/npu_fused_moe.py
Lines: 1-343

Signature

class GmmFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, weight, group_list) -> Tensor
    @staticmethod
    def backward(ctx, grad_output) -> tuple

class HybridGmmFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, num_experts, *args) -> tuple
    @staticmethod
    def backward(ctx, *grad_outputs) -> tuple

class NpuMoeFused:
    @staticmethod
    def npu_moe_experts_forward(self, hidden_states, routing_weights, router_indices) -> torch.Tensor
    @staticmethod
    def npu_moe_sparse_block_forward(self, hidden_states) -> torch.Tensor

class Qwen3NpuMoeFused:
    @staticmethod
    def qwen3moe_sparse_moe_block_forward(self, hidden_states) -> tuple

@register_kernel
class NpuFusedMoEKernel(BaseKernel):
    _kernel_id = "npu_fused_moe"
    _device = DeviceType.NPU
    @classmethod
    def apply(cls, **kwargs) -> HFModel

Import

from llamafactory.v1.plugins.model_plugins.kernels.ops.mlp.npu_fused_moe import NpuFusedMoEKernel

I/O Contract

Inputs

NpuFusedMoEKernel.apply

Name	Type	Required	Description
model	HFModel (via kwargs)	Yes	The HuggingFace model instance; must have a config with architectures attribute

GmmFunction.forward

Name	Type	Required	Description
x	torch.Tensor	Yes	Input tensor for grouped matrix multiplication
weight	torch.Tensor	Yes	Weight tensor for grouped matrix multiplication
group_list	list	Yes	List of group sizes defining the token-to-expert assignment

NpuMoeFused.npu_moe_experts_forward

Name	Type	Required	Description
hidden_states	torch.Tensor	Yes	Input hidden states from the transformer layer
routing_weights	torch.Tensor	Yes	Softmax-normalized routing weights from the gate
router_indices	torch.Tensor	Yes	Top-k expert indices for each token

Outputs

NpuFusedMoEKernel.apply

Name	Type	Description
model	HFModel	The model with MoE forward methods monkey-patched to use NPU fused operations

NpuMoeFused.npu_moe_experts_forward

Name	Type	Description
next_states	torch.Tensor	Output tensor after expert computation and token unpermutation, shape (batch_size, seq_len, hidden_size)

Usage Examples

# Automatic application via kernel interface
from llamafactory.v1.plugins.model_plugins.kernels.interface import apply_kernel

apply_kernel("npu_fused_moe", model=model)

# Direct application
from llamafactory.v1.plugins.model_plugins.kernels.ops.mlp.npu_fused_moe import NpuFusedMoEKernel

NpuFusedMoEKernel.apply(model=model)

Related Pages

Hiyouga_LLaMA_Factory_Kernel_Interface - Kernel discovery and registration interface that manages this kernel
Hiyouga_LLaMA_Factory_NPU_SwiGLU - Related NPU SwiGLU kernel for non-MoE MLP layers

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment