Overview
NPU-optimized fused Mixture-of-Experts (MoE) kernel that replaces standard MoE forward passes with Huawei NPU-native grouped matrix multiplication operations for significant performance improvement.
Description
npu_fused_moe.py implements hardware-accelerated MoE computation for Huawei NPU (Ascend) devices. The module provides:
- GmmFunction: A custom PyTorch autograd function implementing grouped matrix multiplication (GMM) using torch_npu.npu_grouped_matmul. Supports forward and backward passes with proper gradient computation for training.
- HybridGmmFunction: An alternative autograd function for hybrid grouped matrix multiplication that processes per-expert input/weight lists without requiring a cumulative group_list. Used specifically for Qwen3 MoE architectures where experts have individual weight matrices.
- NpuMoeFused: A container class with static methods implementing NPU-fused forward passes for standard MoE architectures:
- npu_moe_experts_forward: Replaces the experts' forward using npu_moe_token_permute, GmmFunction-based grouped matmul with SwiGLU activation, and npu_moe_token_unpermute.
- npu_moe_sparse_block_forward: Replaces the sparse MoE block forward with gate computation, top-k routing, and fused expert execution.
- Qwen3NpuMoeFused: Specialized container for Qwen3 MoE architectures using HybridGmmFunction for per-expert computation with SiLU activation.
- NpuFusedMoEKernel: The registered kernel class (kernel_id: "npu_fused_moe") that applies MoE patches based on model architecture. Uses a kernel_moe_mapping configuration to map model architecture names to their corresponding patched forward functions.
The module supports Qwen3MoE, Qwen3VLMoE, and similar MoE architectures, with version-aware handling for different transformers versions.
Usage
This kernel is automatically discovered and registered by the kernel interface's scan_all_kernels function. It is applied when running MoE models on NPU hardware with kernels enabled. No manual invocation is needed in typical usage.
Code Reference
Source Location
Signature
class GmmFunction(torch.autograd.Function):
@staticmethod
def forward(ctx, x, weight, group_list) -> Tensor
@staticmethod
def backward(ctx, grad_output) -> tuple
class HybridGmmFunction(torch.autograd.Function):
@staticmethod
def forward(ctx, num_experts, *args) -> tuple
@staticmethod
def backward(ctx, *grad_outputs) -> tuple
class NpuMoeFused:
@staticmethod
def npu_moe_experts_forward(self, hidden_states, routing_weights, router_indices) -> torch.Tensor
@staticmethod
def npu_moe_sparse_block_forward(self, hidden_states) -> torch.Tensor
class Qwen3NpuMoeFused:
@staticmethod
def qwen3moe_sparse_moe_block_forward(self, hidden_states) -> tuple
@register_kernel
class NpuFusedMoEKernel(BaseKernel):
_kernel_id = "npu_fused_moe"
_device = DeviceType.NPU
@classmethod
def apply(cls, **kwargs) -> HFModel
Import
from llamafactory.v1.plugins.model_plugins.kernels.ops.mlp.npu_fused_moe import NpuFusedMoEKernel
I/O Contract
Inputs
NpuFusedMoEKernel.apply
| Name |
Type |
Required |
Description
|
| model |
HFModel (via kwargs) |
Yes |
The HuggingFace model instance; must have a config with architectures attribute
|
GmmFunction.forward
| Name |
Type |
Required |
Description
|
| x |
torch.Tensor |
Yes |
Input tensor for grouped matrix multiplication
|
| weight |
torch.Tensor |
Yes |
Weight tensor for grouped matrix multiplication
|
| group_list |
list |
Yes |
List of group sizes defining the token-to-expert assignment
|
NpuMoeFused.npu_moe_experts_forward
| Name |
Type |
Required |
Description
|
| hidden_states |
torch.Tensor |
Yes |
Input hidden states from the transformer layer
|
| routing_weights |
torch.Tensor |
Yes |
Softmax-normalized routing weights from the gate
|
| router_indices |
torch.Tensor |
Yes |
Top-k expert indices for each token
|
Outputs
NpuFusedMoEKernel.apply
| Name |
Type |
Description
|
| model |
HFModel |
The model with MoE forward methods monkey-patched to use NPU fused operations
|
NpuMoeFused.npu_moe_experts_forward
| Name |
Type |
Description
|
| next_states |
torch.Tensor |
Output tensor after expert computation and token unpermutation, shape (batch_size, seq_len, hidden_size)
|
Usage Examples
# Automatic application via kernel interface
from llamafactory.v1.plugins.model_plugins.kernels.interface import apply_kernel
apply_kernel("npu_fused_moe", model=model)
# Direct application
from llamafactory.v1.plugins.model_plugins.kernels.ops.mlp.npu_fused_moe import NpuFusedMoEKernel
NpuFusedMoEKernel.apply(model=model)
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.