Principle:NVIDIA TransformerEngine Rotary Position Embedding
Overview
Encoding position information into attention queries and keys using rotation matrices for relative position awareness.
Description
Rotary Position Embedding (RoPE) encodes position by rotating query and key vectors in pairs of dimensions using sinusoidal functions. This naturally gives attention scores a decay based on relative distance, without requiring explicit position embeddings as separate tokens.
Rather than adding positional vectors to token embeddings (as in absolute position encoding), RoPE applies a rotation to each pair of dimensions in the query and key vectors. The rotation angle is a function of both the dimension index and the token's absolute position. Because rotation is an orthogonal transformation, the dot product between a rotated query and a rotated key depends only on the relative distance between their positions, not their absolute positions.
This approach has several advantages:
- Relative position awareness emerges naturally from the mathematical structure of rotations, without needing to learn explicit relative position biases.
- Flexible sequence lengths are supported because the rotation can be computed for any position index, enabling generalization beyond the training sequence length.
- No additional parameters are introduced; the rotation frequencies are deterministic functions of dimension index and a configurable base frequency.
Theoretical Basis
RoPE applies a rotation matrix to pairs of dimensions in Q and K, where:
- is the frequency for dimension pair
- is the absolute position index
- is the embedding dimension
For a two-dimensional pair , the rotation is:
[cos(m * theta_i) -sin(m * theta_i)] [q_2i ]
[sin(m * theta_i) cos(m * theta_i)] [q_2i+1 ]
The key property is that the attention score between query at position and key at position :
Q_m^T * K_n = f(q, k, m - n)
depends on the relative distance , providing inherent relative position awareness.
Position interpolation extends the context length beyond training by scaling the position index:
m' = m / seq_len_interpolation_factor
This compresses the position indices into the range seen during training, enabling longer sequences at inference time without retraining.
Usage
Use in Transformer models that require position-aware attention. RoPE is standard in modern large language models including:
- LLaMA and its derivatives
- Gemma
- Mistral
- Other modern decoder-only architectures
RoPE is applied to the query and key tensors before the attention score computation. The rotation frequencies are precomputed once for the maximum sequence length and reused across all layers and attention heads.