Implementation:Hiyouga LLaMA Factory NPU RoPE
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Hardware Acceleration, NPU |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
NPU-optimized Rotary Position Embedding (RoPE) kernel that replaces the standard rotate_half-based implementation with Huawei NPU-native torch_npu.npu_rotary_mul for accelerated positional encoding in every transformer attention layer.
Description
npu_rope.py implements a hardware-accelerated replacement for the Rotary Position Embedding computation used in transformer attention layers. RoPE is applied to query and key tensors in every attention layer, making it a high-frequency operation that benefits significantly from hardware optimization.
The module provides:
- _apply_rotary_pos_emb: The NPU-accelerated standard RoPE function that replaces the default rotate_half implementation. It uses torch_npu.npu_rotary_mul to compute the rotary embedding for both query and key tensors in a single fused operation, avoiding the intermediate rotate_half computation.
- _apply_multimodal_rotary_pos_emb_qwen25_vl: A specialized variant for Qwen2.5-VL multimodal models that handles the multimodal RoPE section splitting (mrope_section) before applying npu_rotary_mul. This handles the 3D positional encoding (temporal, height, width) used in vision-language models.
- NpuRoPEKernel: The registered kernel class (kernel_id: "npu_fused_rope") that applies the optimization by:
- Iterating over all model modules to find attention layers (classes with "Attention" in their name).
- Identifying the Python module where each attention class is defined.
- Monkey-patching the apply_rotary_pos_emb function (and apply_multimodal_rotary_pos_emb for VL models) at the module level using setattr on sys.modules entries.
- Deduplicating patches to avoid redundant replacements for shared module definitions.
Usage
This kernel is automatically discovered and registered by the kernel interface. It is applied when running any transformer model with RoPE on NPU hardware. The patch operates at the module level (not instance level), so all attention layers using the same module definition are optimized simultaneously.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/v1/plugins/model_plugins/kernels/ops/rope/npu_rope.py
- Lines: 1-149
Signature
def _apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1) -> tuple[Tensor, Tensor]
def _apply_multimodal_rotary_pos_emb_qwen25_vl(q, k, cos, sin, mrope_section, unsqueeze_dim=1) -> tuple[Tensor, Tensor]
@register_kernel
class NpuRoPEKernel(BaseKernel):
_kernel_id = "npu_fused_rope"
_device = DeviceType.NPU
@classmethod
def apply(cls, **kwargs) -> HFModel
Import
from llamafactory.v1.plugins.model_plugins.kernels.ops.rope.npu_rope import NpuRoPEKernel
I/O Contract
Inputs
NpuRoPEKernel.apply
| Name | Type | Required | Description |
|---|---|---|---|
| model | HFModel (via kwargs) | Yes | The HuggingFace model instance; its attention modules will be inspected for RoPE patching |
_apply_rotary_pos_emb
| Name | Type | Required | Description |
|---|---|---|---|
| q | torch.Tensor | Yes | Query tensor from the attention layer |
| k | torch.Tensor | Yes | Key tensor from the attention layer |
| cos | torch.Tensor | Yes | Cosine component of the rotary embedding |
| sin | torch.Tensor | Yes | Sine component of the rotary embedding |
| position_ids | torch.Tensor | No | Position IDs (unused in NPU implementation, kept for API compatibility) |
| unsqueeze_dim | int | No | Dimension to unsqueeze cos/sin tensors (default: 1) |
Outputs
NpuRoPEKernel.apply
| Name | Type | Description |
|---|---|---|
| model | HFModel | The model with apply_rotary_pos_emb patched to use NPU-native npu_rotary_mul |
_apply_rotary_pos_emb
| Name | Type | Description |
|---|---|---|
| q_embed | torch.Tensor | Query tensor with rotary position embedding applied |
| k_embed | torch.Tensor | Key tensor with rotary position embedding applied |
Usage Examples
# Automatic application via kernel interface
from llamafactory.v1.plugins.model_plugins.kernels.interface import apply_kernel
apply_kernel("npu_fused_rope", model=model)
# Direct application
from llamafactory.v1.plugins.model_plugins.kernels.ops.rope.npu_rope import NpuRoPEKernel
NpuRoPEKernel.apply(model=model)
Related Pages
- Hiyouga_LLaMA_Factory_Kernel_Interface - Kernel discovery and registration interface that manages this kernel
- Hiyouga_LLaMA_Factory_NPU_SwiGLU - Related NPU SwiGLU kernel for MLP layers
- Hiyouga_LLaMA_Factory_NPU_Fused_MoE - Related NPU MoE kernel for expert layers