Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory NPU RoPE

From Leeroopedia
Revision as of 15:06, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Hiyouga_LLaMA_Factory_NPU_RoPE.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Machine Learning, Hardware Acceleration, NPU
Last Updated 2026-02-06 19:00 GMT

Overview

NPU-optimized Rotary Position Embedding (RoPE) kernel that replaces the standard rotate_half-based implementation with Huawei NPU-native torch_npu.npu_rotary_mul for accelerated positional encoding in every transformer attention layer.

Description

npu_rope.py implements a hardware-accelerated replacement for the Rotary Position Embedding computation used in transformer attention layers. RoPE is applied to query and key tensors in every attention layer, making it a high-frequency operation that benefits significantly from hardware optimization.

The module provides:

  • _apply_rotary_pos_emb: The NPU-accelerated standard RoPE function that replaces the default rotate_half implementation. It uses torch_npu.npu_rotary_mul to compute the rotary embedding for both query and key tensors in a single fused operation, avoiding the intermediate rotate_half computation.
  • _apply_multimodal_rotary_pos_emb_qwen25_vl: A specialized variant for Qwen2.5-VL multimodal models that handles the multimodal RoPE section splitting (mrope_section) before applying npu_rotary_mul. This handles the 3D positional encoding (temporal, height, width) used in vision-language models.
  • NpuRoPEKernel: The registered kernel class (kernel_id: "npu_fused_rope") that applies the optimization by:
    • Iterating over all model modules to find attention layers (classes with "Attention" in their name).
    • Identifying the Python module where each attention class is defined.
    • Monkey-patching the apply_rotary_pos_emb function (and apply_multimodal_rotary_pos_emb for VL models) at the module level using setattr on sys.modules entries.
    • Deduplicating patches to avoid redundant replacements for shared module definitions.

Usage

This kernel is automatically discovered and registered by the kernel interface. It is applied when running any transformer model with RoPE on NPU hardware. The patch operates at the module level (not instance level), so all attention layers using the same module definition are optimized simultaneously.

Code Reference

Source Location

Signature

def _apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1) -> tuple[Tensor, Tensor]

def _apply_multimodal_rotary_pos_emb_qwen25_vl(q, k, cos, sin, mrope_section, unsqueeze_dim=1) -> tuple[Tensor, Tensor]

@register_kernel
class NpuRoPEKernel(BaseKernel):
    _kernel_id = "npu_fused_rope"
    _device = DeviceType.NPU

    @classmethod
    def apply(cls, **kwargs) -> HFModel

Import

from llamafactory.v1.plugins.model_plugins.kernels.ops.rope.npu_rope import NpuRoPEKernel

I/O Contract

Inputs

NpuRoPEKernel.apply

Name Type Required Description
model HFModel (via kwargs) Yes The HuggingFace model instance; its attention modules will be inspected for RoPE patching

_apply_rotary_pos_emb

Name Type Required Description
q torch.Tensor Yes Query tensor from the attention layer
k torch.Tensor Yes Key tensor from the attention layer
cos torch.Tensor Yes Cosine component of the rotary embedding
sin torch.Tensor Yes Sine component of the rotary embedding
position_ids torch.Tensor No Position IDs (unused in NPU implementation, kept for API compatibility)
unsqueeze_dim int No Dimension to unsqueeze cos/sin tensors (default: 1)

Outputs

NpuRoPEKernel.apply

Name Type Description
model HFModel The model with apply_rotary_pos_emb patched to use NPU-native npu_rotary_mul

_apply_rotary_pos_emb

Name Type Description
q_embed torch.Tensor Query tensor with rotary position embedding applied
k_embed torch.Tensor Key tensor with rotary position embedding applied

Usage Examples

# Automatic application via kernel interface
from llamafactory.v1.plugins.model_plugins.kernels.interface import apply_kernel

apply_kernel("npu_fused_rope", model=model)

# Direct application
from llamafactory.v1.plugins.model_plugins.kernels.ops.rope.npu_rope import NpuRoPEKernel

NpuRoPEKernel.apply(model=model)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment