Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA TransformerEngine TE RotaryPositionEmbedding

From Leeroopedia
Revision as of 16:00, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/NVIDIA_TransformerEngine_TE_RotaryPositionEmbedding.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Overview

Concrete tool for computing rotary position embeddings provided by TransformerEngine.

Description

RotaryPositionEmbedding computes sinusoidal rotation frequencies and generates embedding tensors for a given max sequence length. It supports configurable rotary percentage, position interpolation, and interleaved layout.

The module precomputes inverse frequencies based on the configured rotary_base and embedding dim, then generates a cosine/sine frequency tensor on each forward call for the requested sequence length. This tensor is consumed by TE's attention kernels to apply rotary position encoding to query and key vectors.

Key capabilities include:

  • Partial rotation via rotary_percent: Only a fraction of the head dimensions receive rotary encoding; the rest are left unmodified.
  • Position interpolation via seq_len_interpolation_factor: Scales position indices to extend context length beyond training length.
  • Interleaved layout via interleaved: Supports both the standard paired layout [d0, d1, d2, d3, ...] and the interleaved layout used by some model architectures.
  • Configurable base frequency via rotary_base: Adjusts the wavelength spectrum of the rotation frequencies.

Source

transformer_engine/pytorch/attention/rope.py, class RotaryPositionEmbedding at L18-109

Import

from transformer_engine.pytorch.attention import RotaryPositionEmbedding

Signature

class RotaryPositionEmbedding(torch.nn.Module):
    def __init__(
        self,
        dim: int,
        rotary_percent: float = 1.0,
        seq_len_interpolation_factor: Optional[int] = None,
        pretrained_max_position_embeddings: Optional[int] = None,
        rotary_base: float = 10000.0,
        interleaved: bool = False,
    ):

    def forward(self, max_seq_len: int, offset: int = 0) -> torch.Tensor:

I/O

Direction Description
Input dim (int): The rotation dimension (typically head_dim). Configuration parameters control rotary percentage, interpolation, base frequency, and layout.
Output torch.Tensor of shape [max_seq_len, 1, 1, dim] containing rotation frequencies (cosine and sine components) for each position.

Key Parameters

Parameter Type Default Description
dim int required Rotation dimension, typically equal to the attention head dimension.
rotary_percent float 1.0 Fraction of dimensions to apply rotary encoding to. Values less than 1.0 leave remaining dimensions unrotated.
seq_len_interpolation_factor Optional[int] None Factor for position interpolation to extend context length beyond training length.
pretrained_max_position_embeddings Optional[int] None Maximum position embeddings from pretrained model, used with interpolation.
rotary_base float 10000.0 Base for computing inverse frequencies (base^(-2i/d)).
interleaved bool False Whether to use interleaved dimension layout for rotary pairs.

Example Usage

import torch
from transformer_engine.pytorch.attention import RotaryPositionEmbedding

# Create RoPE module for head_dim=128
rope = RotaryPositionEmbedding(dim=128, rotary_base=10000.0)

# Generate position embeddings for sequence length 2048
rotary_emb = rope(max_seq_len=2048)
# rotary_emb.shape: [2048, 1, 1, 128]

# With position interpolation for extended context
rope_extended = RotaryPositionEmbedding(
    dim=128,
    seq_len_interpolation_factor=2,
    pretrained_max_position_embeddings=2048,
)
rotary_emb_ext = rope_extended(max_seq_len=4096)

Related

Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment