Implementation:Hiyouga LLaMA Factory NPU SwiGLU
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Hardware Acceleration, NPU |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
NPU-optimized SwiGLU activation kernel that fuses gate projection, up projection, and SiLU activation into a single NPU-native operation for MLP layers across 20+ supported model architectures.
Description
npu_swiglu.py implements a hardware-accelerated SwiGLU activation replacement for Huawei NPU (Ascend) devices. The SwiGLU activation (SiLU(gate) * up) is a common pattern in modern LLMs like LLaMA, Qwen, and Gemma. This module fuses the operation using torch_npu.npu_swiglu for improved throughput.
The module provides:
- npu_swiglu_forward: The default SwiGLU forward function that concatenates gate_proj and up_proj outputs, applies npu_swiglu, then passes through down_proj. Works for most architectures (LLaMA, Qwen2, Qwen3, DeepSeek, etc.).
- _npu_swiglu_glm4_forward: Specialized variant for GLM4 and Phi3 architectures that use a fused gate_up_proj with chunk-based splitting.
- _npu_swiglu_gemma3ntext_forward: Specialized variant for Gemma3nText that supports activation sparsity via gaussian_topk before the SwiGLU operation.
- NpuSwiGluKernel: The registered kernel class (kernel_id: "npu_fused_swiglu") with:
- expect_modules: A frozenset of 21 supported MLP module class names including Qwen3MLP, LlamaMLP, Glm4MLP, Gemma3MLP, DeepseekV3MLP, and others.
- apply: Iterates over model modules, matches MLP layers by class name against expect_modules, and monkey-patches their forward methods with the appropriate kernel function using types.MethodType.
Usage
This kernel is automatically discovered and registered by the kernel interface. It is applied when running SwiGLU-based models (LLaMA, Qwen, Gemma, GLM4, DeepSeek, etc.) on NPU hardware. The kernel is only applied to MLP modules whose class names are in the expect_modules set.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/v1/plugins/model_plugins/kernels/ops/mlp/npu_swiglu.py
- Lines: 1-168
Signature
def npu_swiglu_forward(self, hidden_state) -> torch.Tensor
@register_kernel
class NpuSwiGluKernel(BaseKernel):
expect_modules = frozenset({
"Qwen3MLP", "Qwen2MLP", "LlamaMLP", "Glm4MLP",
"Gemma3MLP", "DeepseekV3MLP", "Phi3MLP", ...
})
_kernel_id = "npu_fused_swiglu"
_device = DeviceType.NPU
@classmethod
def apply(cls, **kwargs) -> HFModel
Import
from llamafactory.v1.plugins.model_plugins.kernels.ops.mlp.npu_swiglu import NpuSwiGluKernel
I/O Contract
Inputs
NpuSwiGluKernel.apply
| Name | Type | Required | Description |
|---|---|---|---|
| model | HFModel (via kwargs) | Yes | The HuggingFace model instance containing MLP modules to patch |
npu_swiglu_forward
| Name | Type | Required | Description |
|---|---|---|---|
| hidden_state | torch.Tensor | Yes | Input hidden state tensor from the attention layer output |
Outputs
NpuSwiGluKernel.apply
| Name | Type | Description |
|---|---|---|
| model | HFModel | The model with MLP forward methods monkey-patched to use NPU fused SwiGLU |
npu_swiglu_forward
| Name | Type | Description |
|---|---|---|
| output | torch.Tensor | Output of down_proj(npu_swiglu(cat(gate_proj(x), up_proj(x)))) |
Usage Examples
# Automatic application via kernel interface
from llamafactory.v1.plugins.model_plugins.kernels.interface import apply_kernel
apply_kernel("npu_fused_swiglu", model=model)
# Direct application
from llamafactory.v1.plugins.model_plugins.kernels.ops.mlp.npu_swiglu import NpuSwiGluKernel
NpuSwiGluKernel.apply(model=model)
Related Pages
- Hiyouga_LLaMA_Factory_Kernel_Interface - Kernel discovery and registration interface that manages this kernel
- Hiyouga_LLaMA_Factory_NPU_Fused_MoE - Related NPU MoE kernel for MoE-based MLP layers
- Hiyouga_LLaMA_Factory_NPU_RoPE - Related NPU RoPE kernel for attention layers