Principle:NVIDIA TransformerEngine Rotary Position Embedding

Overview

Encoding position information into attention queries and keys using rotation matrices for relative position awareness.

Description

Rotary Position Embedding (RoPE) encodes position by rotating query and key vectors in pairs of dimensions using sinusoidal functions. This naturally gives attention scores a decay based on relative distance, without requiring explicit position embeddings as separate tokens.

Rather than adding positional vectors to token embeddings (as in absolute position encoding), RoPE applies a rotation to each pair of dimensions in the query and key vectors. The rotation angle is a function of both the dimension index and the token's absolute position. Because rotation is an orthogonal transformation, the dot product between a rotated query and a rotated key depends only on the relative distance between their positions, not their absolute positions.

This approach has several advantages:

Relative position awareness emerges naturally from the mathematical structure of rotations, without needing to learn explicit relative position biases.
Flexible sequence lengths are supported because the rotation can be computed for any position index, enabling generalization beyond the training sequence length.
No additional parameters are introduced; the rotation frequencies are deterministic functions of dimension index and a configurable base frequency.

Theoretical Basis

RoPE applies a rotation matrix $R (θ_{i} \cdot m)$ to pairs of dimensions in Q and K, where:

$θ_{i} = {base}^{- 2 i / d}$ is the frequency for dimension pair $i$
$m$ is the absolute position index
$d$ is the embedding dimension

For a two-dimensional pair $(q_{2 i}, q_{2 i + 1})$ , the rotation is:

[cos(m * theta_i)  -sin(m * theta_i)] [q_2i    ]
[sin(m * theta_i)   cos(m * theta_i)] [q_2i+1  ]

The key property is that the attention score between query at position $m$ and key at position $n$ :

Q_m^T * K_n = f(q, k, m - n)

depends on the relative distance $(m - n)$ , providing inherent relative position awareness.

Position interpolation extends the context length beyond training by scaling the position index:

m' = m / seq_len_interpolation_factor

This compresses the position indices into the range seen during training, enabling longer sequences at inference time without retraining.

Usage

Use in Transformer models that require position-aware attention. RoPE is standard in modern large language models including:

LLaMA and its derivatives
Gemma
Mistral
Other modern decoder-only architectures

RoPE is applied to the query and key tensors before the attention score computation. The rotation frequencies are precomputed once for the maximum sequence length and reused across all layers and attention heads.

Sources

Domains

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment