Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Hiyouga LLaMA Factory Rotary Position Embedding

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Positional Encoding
Last Updated 2026-02-06 19:00 GMT

Overview

Rotary Position Embedding (RoPE) encodes positional information by rotating query and key vectors in the attention mechanism, enabling relative position awareness without additional learnable parameters.

Description

Traditional absolute position embeddings add position-dependent vectors to token representations, which limits generalization to sequence lengths not seen during training. RoPE instead encodes position information by applying a rotation matrix to query and key vectors in each attention head. The rotation angle is a function of both the position index and the embedding dimension index, creating a position-dependent transformation that naturally captures relative distances through the inner product of rotated vectors.

In LLaMA-Factory, RoPE is foundational because all supported LLaMA-family models (and many other architectures) use it as their primary positional encoding. The framework provides:

  • RoPE scaling configuration that adjusts the effective context length by modifying the rotation frequencies. Supported scaling strategies include:
    • Linear scaling: Divides frequencies by a constant factor, linearly extending the context window.
    • Dynamic NTK scaling: Adjusts the base frequency dynamically, providing better extrapolation for long contexts.
    • YaRN scaling: Combines NTK-aware interpolation with attention scaling for improved long-context performance.
    • Llama3 scaling: Uses frequency-band-specific scaling with configurable low/high frequency factors.
  • NPU-accelerated RoPE via torch_npu.npu_rotary_mul, which fuses the rotation computation into a single kernel call on Ascend NPU hardware.

Usage

RoPE configuration is applied when:

  • Training or fine-tuning a model on sequences longer than its pretrained context window (e.g., extending a 4K model to 16K tokens).
  • Deploying models for inference with extended context requirements.
  • Running on NPU hardware where the fused RoPE kernel provides significant speedup over the standard PyTorch implementation.

The scaling factor is computed as ceil(model_max_length / original_max_position_embeddings) and is only applied when the desired length exceeds the model's original maximum.

Theoretical Basis

RoPE applies a position-dependent rotation to each pair of dimensions in the query and key vectors. For a d-dimensional embedding at position m, the rotation is defined by:

f(xm,m)=RΘ,mxm

where RΘ,m is a block-diagonal rotation matrix composed of d/2 two-dimensional rotations:

RΘ,m=(cosmθ1sinmθ1sinmθ1cosmθ1cosmθd/2sinmθd/2sinmθd/2cosmθd/2)

The rotation frequencies are defined as:

θi=100002i/d,i=0,1,,d/21

The key property of RoPE is that the inner product of two rotated vectors depends only on their relative position:

RΘ,mq,RΘ,nk=RΘ,mnq,k

This naturally encodes relative position information into the attention scores without additional parameters.

Context length extension works by modifying the frequency schedule. For linear scaling with factor α:

θi=θi/α

This effectively reduces the rotation speed, allowing more positions before the pattern repeats. The scaling factor in LLaMA-Factory is calculated as:

rope_factor = float(math.ceil(model_max_length / old_max_length))
setattr(config, "max_position_embeddings", old_max_length * rope_factor)

For Dynamic NTK scaling, the base frequency is adjusted rather than the per-dimension frequencies, providing smoother interpolation. YaRN further refines this by applying different scaling factors to different frequency bands and includes an attention temperature correction.

The NPU-optimized implementation replaces the element-wise rotation computation with a single fused kernel:

# Standard implementation (multiple ops)
# q_embed = q * cos + rotate_half(q) * sin

# NPU fused implementation (single kernel)
q_embed = torch_npu.npu_rotary_mul(q, cos, sin)
k_embed = torch_npu.npu_rotary_mul(k, cos, sin)

This fusion eliminates intermediate tensor allocations and memory bandwidth overhead, particularly beneficial for the repeated application across all attention layers.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment