Implementation:NVIDIA TransformerEngine TE RotaryPositionEmbedding
Overview
Concrete tool for computing rotary position embeddings provided by TransformerEngine.
Description
RotaryPositionEmbedding computes sinusoidal rotation frequencies and generates embedding tensors for a given max sequence length. It supports configurable rotary percentage, position interpolation, and interleaved layout.
The module precomputes inverse frequencies based on the configured rotary_base and embedding dim, then generates a cosine/sine frequency tensor on each forward call for the requested sequence length. This tensor is consumed by TE's attention kernels to apply rotary position encoding to query and key vectors.
Key capabilities include:
- Partial rotation via
rotary_percent: Only a fraction of the head dimensions receive rotary encoding; the rest are left unmodified. - Position interpolation via
seq_len_interpolation_factor: Scales position indices to extend context length beyond training length. - Interleaved layout via
interleaved: Supports both the standard paired layout[d0, d1, d2, d3, ...]and the interleaved layout used by some model architectures. - Configurable base frequency via
rotary_base: Adjusts the wavelength spectrum of the rotation frequencies.
Source
transformer_engine/pytorch/attention/rope.py, class RotaryPositionEmbedding at L18-109
Import
from transformer_engine.pytorch.attention import RotaryPositionEmbedding
Signature
class RotaryPositionEmbedding(torch.nn.Module):
def __init__(
self,
dim: int,
rotary_percent: float = 1.0,
seq_len_interpolation_factor: Optional[int] = None,
pretrained_max_position_embeddings: Optional[int] = None,
rotary_base: float = 10000.0,
interleaved: bool = False,
):
def forward(self, max_seq_len: int, offset: int = 0) -> torch.Tensor:
I/O
| Direction | Description |
|---|---|
| Input | dim (int): The rotation dimension (typically head_dim). Configuration parameters control rotary percentage, interpolation, base frequency, and layout.
|
| Output | torch.Tensor of shape [max_seq_len, 1, 1, dim] containing rotation frequencies (cosine and sine components) for each position.
|
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
dim |
int |
required | Rotation dimension, typically equal to the attention head dimension. |
rotary_percent |
float |
1.0 |
Fraction of dimensions to apply rotary encoding to. Values less than 1.0 leave remaining dimensions unrotated. |
seq_len_interpolation_factor |
Optional[int] |
None |
Factor for position interpolation to extend context length beyond training length. |
pretrained_max_position_embeddings |
Optional[int] |
None |
Maximum position embeddings from pretrained model, used with interpolation. |
rotary_base |
float |
10000.0 |
Base for computing inverse frequencies (base^(-2i/d)).
|
interleaved |
bool |
False |
Whether to use interleaved dimension layout for rotary pairs. |
Example Usage
import torch
from transformer_engine.pytorch.attention import RotaryPositionEmbedding
# Create RoPE module for head_dim=128
rope = RotaryPositionEmbedding(dim=128, rotary_base=10000.0)
# Generate position embeddings for sequence length 2048
rotary_emb = rope(max_seq_len=2048)
# rotary_emb.shape: [2048, 1, 1, 128]
# With position interpolation for extended context
rope_extended = RotaryPositionEmbedding(
dim=128,
seq_len_interpolation_factor=2,
pretrained_max_position_embeddings=2048,
)
rotary_emb_ext = rope_extended(max_seq_len=4096)
Related
Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements