Implementation:NVIDIA TransformerEngine TE RotaryPositionEmbedding

Overview

Concrete tool for computing rotary position embeddings provided by TransformerEngine.

Description

RotaryPositionEmbedding computes sinusoidal rotation frequencies and generates embedding tensors for a given max sequence length. It supports configurable rotary percentage, position interpolation, and interleaved layout.

The module precomputes inverse frequencies based on the configured rotary_base and embedding dim, then generates a cosine/sine frequency tensor on each forward call for the requested sequence length. This tensor is consumed by TE's attention kernels to apply rotary position encoding to query and key vectors.

Key capabilities include:

Partial rotation via rotary_percent: Only a fraction of the head dimensions receive rotary encoding; the rest are left unmodified.
Position interpolation via seq_len_interpolation_factor: Scales position indices to extend context length beyond training length.
Interleaved layout via interleaved: Supports both the standard paired layout [d0, d1, d2, d3, ...] and the interleaved layout used by some model architectures.
Configurable base frequency via rotary_base: Adjusts the wavelength spectrum of the rotation frequencies.

Source

transformer_engine/pytorch/attention/rope.py, class RotaryPositionEmbedding at L18-109

Import

from transformer_engine.pytorch.attention import RotaryPositionEmbedding

Signature

class RotaryPositionEmbedding(torch.nn.Module):
    def __init__(
        self,
        dim: int,
        rotary_percent: float = 1.0,
        seq_len_interpolation_factor: Optional[int] = None,
        pretrained_max_position_embeddings: Optional[int] = None,
        rotary_base: float = 10000.0,
        interleaved: bool = False,
    ):

    def forward(self, max_seq_len: int, offset: int = 0) -> torch.Tensor:

I/O

Direction	Description
Input	`dim` (int): The rotation dimension (typically `head_dim`). Configuration parameters control rotary percentage, interpolation, base frequency, and layout.
Output	`torch.Tensor` of shape `[max_seq_len, 1, 1, dim]` containing rotation frequencies (cosine and sine components) for each position.

Key Parameters

Parameter	Type	Default	Description
`dim`	`int`	required	Rotation dimension, typically equal to the attention head dimension.
`rotary_percent`	`float`	`1.0`	Fraction of dimensions to apply rotary encoding to. Values less than 1.0 leave remaining dimensions unrotated.
`seq_len_interpolation_factor`	`Optional[int]`	`None`	Factor for position interpolation to extend context length beyond training length.
`pretrained_max_position_embeddings`	`Optional[int]`	`None`	Maximum position embeddings from pretrained model, used with interpolation.
`rotary_base`	`float`	`10000.0`	Base for computing inverse frequencies (`base^(-2i/d)`).
`interleaved`	`bool`	`False`	Whether to use interleaved dimension layout for rotary pairs.

Example Usage

import torch
from transformer_engine.pytorch.attention import RotaryPositionEmbedding

# Create RoPE module for head_dim=128
rope = RotaryPositionEmbedding(dim=128, rotary_base=10000.0)

# Generate position embeddings for sequence length 2048
rotary_emb = rope(max_seq_len=2048)
# rotary_emb.shape: [2048, 1, 1, 128]

# With position interpolation for extended context
rope_extended = RotaryPositionEmbedding(
    dim=128,
    seq_len_interpolation_factor=2,
    pretrained_max_position_embeddings=2048,
)
rotary_emb_ext = rope_extended(max_seq_len=4096)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment