Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory NPU Fused MoE

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Hardware Acceleration, NPU
Last Updated 2026-02-06 19:00 GMT

Overview

NPU-optimized fused Mixture-of-Experts (MoE) kernel that replaces standard MoE forward passes with Huawei NPU-native grouped matrix multiplication operations for significant performance improvement.

Description

npu_fused_moe.py implements hardware-accelerated MoE computation for Huawei NPU (Ascend) devices. The module provides:

  • GmmFunction: A custom PyTorch autograd function implementing grouped matrix multiplication (GMM) using torch_npu.npu_grouped_matmul. Supports forward and backward passes with proper gradient computation for training.
  • HybridGmmFunction: An alternative autograd function for hybrid grouped matrix multiplication that processes per-expert input/weight lists without requiring a cumulative group_list. Used specifically for Qwen3 MoE architectures where experts have individual weight matrices.
  • NpuMoeFused: A container class with static methods implementing NPU-fused forward passes for standard MoE architectures:
    • npu_moe_experts_forward: Replaces the experts' forward using npu_moe_token_permute, GmmFunction-based grouped matmul with SwiGLU activation, and npu_moe_token_unpermute.
    • npu_moe_sparse_block_forward: Replaces the sparse MoE block forward with gate computation, top-k routing, and fused expert execution.
  • Qwen3NpuMoeFused: Specialized container for Qwen3 MoE architectures using HybridGmmFunction for per-expert computation with SiLU activation.
  • NpuFusedMoEKernel: The registered kernel class (kernel_id: "npu_fused_moe") that applies MoE patches based on model architecture. Uses a kernel_moe_mapping configuration to map model architecture names to their corresponding patched forward functions.

The module supports Qwen3MoE, Qwen3VLMoE, and similar MoE architectures, with version-aware handling for different transformers versions.

Usage

This kernel is automatically discovered and registered by the kernel interface's scan_all_kernels function. It is applied when running MoE models on NPU hardware with kernels enabled. No manual invocation is needed in typical usage.

Code Reference

Source Location

Signature

class GmmFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, weight, group_list) -> Tensor
    @staticmethod
    def backward(ctx, grad_output) -> tuple

class HybridGmmFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, num_experts, *args) -> tuple
    @staticmethod
    def backward(ctx, *grad_outputs) -> tuple

class NpuMoeFused:
    @staticmethod
    def npu_moe_experts_forward(self, hidden_states, routing_weights, router_indices) -> torch.Tensor
    @staticmethod
    def npu_moe_sparse_block_forward(self, hidden_states) -> torch.Tensor

class Qwen3NpuMoeFused:
    @staticmethod
    def qwen3moe_sparse_moe_block_forward(self, hidden_states) -> tuple

@register_kernel
class NpuFusedMoEKernel(BaseKernel):
    _kernel_id = "npu_fused_moe"
    _device = DeviceType.NPU
    @classmethod
    def apply(cls, **kwargs) -> HFModel

Import

from llamafactory.v1.plugins.model_plugins.kernels.ops.mlp.npu_fused_moe import NpuFusedMoEKernel

I/O Contract

Inputs

NpuFusedMoEKernel.apply

Name Type Required Description
model HFModel (via kwargs) Yes The HuggingFace model instance; must have a config with architectures attribute

GmmFunction.forward

Name Type Required Description
x torch.Tensor Yes Input tensor for grouped matrix multiplication
weight torch.Tensor Yes Weight tensor for grouped matrix multiplication
group_list list Yes List of group sizes defining the token-to-expert assignment

NpuMoeFused.npu_moe_experts_forward

Name Type Required Description
hidden_states torch.Tensor Yes Input hidden states from the transformer layer
routing_weights torch.Tensor Yes Softmax-normalized routing weights from the gate
router_indices torch.Tensor Yes Top-k expert indices for each token

Outputs

NpuFusedMoEKernel.apply

Name Type Description
model HFModel The model with MoE forward methods monkey-patched to use NPU fused operations

NpuMoeFused.npu_moe_experts_forward

Name Type Description
next_states torch.Tensor Output tensor after expert computation and token unpermutation, shape (batch_size, seq_len, hidden_size)

Usage Examples

# Automatic application via kernel interface
from llamafactory.v1.plugins.model_plugins.kernels.interface import apply_kernel

apply_kernel("npu_fused_moe", model=model)

# Direct application
from llamafactory.v1.plugins.model_plugins.kernels.ops.mlp.npu_fused_moe import NpuFusedMoEKernel

NpuFusedMoEKernel.apply(model=model)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment