Implementation:Hiyouga LLaMA Factory NPU RoPE

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Machine Learning, Hardware Acceleration, NPU
Last Updated	2026-02-06 19:00 GMT

Overview

NPU-optimized Rotary Position Embedding (RoPE) kernel that replaces the standard rotate_half-based implementation with Huawei NPU-native torch_npu.npu_rotary_mul for accelerated positional encoding in every transformer attention layer.

Description

npu_rope.py implements a hardware-accelerated replacement for the Rotary Position Embedding computation used in transformer attention layers. RoPE is applied to query and key tensors in every attention layer, making it a high-frequency operation that benefits significantly from hardware optimization.

The module provides:

_apply_rotary_pos_emb: The NPU-accelerated standard RoPE function that replaces the default rotate_half implementation. It uses torch_npu.npu_rotary_mul to compute the rotary embedding for both query and key tensors in a single fused operation, avoiding the intermediate rotate_half computation.
_apply_multimodal_rotary_pos_emb_qwen25_vl: A specialized variant for Qwen2.5-VL multimodal models that handles the multimodal RoPE section splitting (mrope_section) before applying npu_rotary_mul. This handles the 3D positional encoding (temporal, height, width) used in vision-language models.
NpuRoPEKernel: The registered kernel class (kernel_id: "npu_fused_rope") that applies the optimization by:
- Iterating over all model modules to find attention layers (classes with "Attention" in their name).
- Identifying the Python module where each attention class is defined.
- Monkey-patching the apply_rotary_pos_emb function (and apply_multimodal_rotary_pos_emb for VL models) at the module level using setattr on sys.modules entries.
- Deduplicating patches to avoid redundant replacements for shared module definitions.

Usage

This kernel is automatically discovered and registered by the kernel interface. It is applied when running any transformer model with RoPE on NPU hardware. The patch operates at the module level (not instance level), so all attention layers using the same module definition are optimized simultaneously.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/v1/plugins/model_plugins/kernels/ops/rope/npu_rope.py
Lines: 1-149

Signature

def _apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1) -> tuple[Tensor, Tensor]

def _apply_multimodal_rotary_pos_emb_qwen25_vl(q, k, cos, sin, mrope_section, unsqueeze_dim=1) -> tuple[Tensor, Tensor]

@register_kernel
class NpuRoPEKernel(BaseKernel):
    _kernel_id = "npu_fused_rope"
    _device = DeviceType.NPU

    @classmethod
    def apply(cls, **kwargs) -> HFModel

Import

from llamafactory.v1.plugins.model_plugins.kernels.ops.rope.npu_rope import NpuRoPEKernel

I/O Contract

Inputs

NpuRoPEKernel.apply

Name	Type	Required	Description
model	HFModel (via kwargs)	Yes	The HuggingFace model instance; its attention modules will be inspected for RoPE patching

_apply_rotary_pos_emb

Name	Type	Required	Description
q	torch.Tensor	Yes	Query tensor from the attention layer
k	torch.Tensor	Yes	Key tensor from the attention layer
cos	torch.Tensor	Yes	Cosine component of the rotary embedding
sin	torch.Tensor	Yes	Sine component of the rotary embedding
position_ids	torch.Tensor	No	Position IDs (unused in NPU implementation, kept for API compatibility)
unsqueeze_dim	int	No	Dimension to unsqueeze cos/sin tensors (default: 1)

Outputs

NpuRoPEKernel.apply

Name	Type	Description
model	HFModel	The model with apply_rotary_pos_emb patched to use NPU-native npu_rotary_mul

_apply_rotary_pos_emb

Name	Type	Description
q_embed	torch.Tensor	Query tensor with rotary position embedding applied
k_embed	torch.Tensor	Key tensor with rotary position embedding applied

Usage Examples

# Automatic application via kernel interface
from llamafactory.v1.plugins.model_plugins.kernels.interface import apply_kernel

apply_kernel("npu_fused_rope", model=model)

# Direct application
from llamafactory.v1.plugins.model_plugins.kernels.ops.rope.npu_rope import NpuRoPEKernel

NpuRoPEKernel.apply(model=model)

Related Pages

Hiyouga_LLaMA_Factory_Kernel_Interface - Kernel discovery and registration interface that manages this kernel
Hiyouga_LLaMA_Factory_NPU_SwiGLU - Related NPU SwiGLU kernel for MLP layers
Hiyouga_LLaMA_Factory_NPU_Fused_MoE - Related NPU MoE kernel for expert layers

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment