Heuristic:Lucidrains X transformers Rotary Position Embedding Selection

Knowledge Sources	x-transformers README positional embeddings NTK-Aware RoPE
Domains	Deep_Learning, LLMs
Last Updated	2026-02-08 18:00 GMT

Overview

Guide for selecting and configuring positional embeddings in x-transformers, with rotary embeddings as the strongly recommended default.

Description

x-transformers supports multiple positional embedding strategies: rotary (RoPE), ALiBi, absolute learned embeddings, dynamic position bias, T5 relative bias, CoPE, and polar embeddings. The README and code strongly recommend rotary embeddings for most use cases. This heuristic captures the selection criteria and configuration details.

Usage

Use this heuristic when configuring a new model and choosing between positional embedding strategies, or when experiencing issues with sequence length generalization or training stability.

The Insight (Rule of Thumb)

Action: Set `rotary_pos_emb=True` on Decoder/Encoder attention layers for ordered sequences.
Value: Boolean flag with optional dimension tuning via `rotary_emb_dim`.
Trade-off: Rotary embeddings are compatible with flash attention, KV caching, and sequence length extrapolation. They are the most versatile choice.
Minimum dimension: `rotary_emb_dim` should be >= 32 for language models (warned in code).
NTK-Aware scaling: For sequence length extrapolation without fine-tuning, use `rotary_base_rescale_factor` > 1.0.

Comparison Table

Method	Flash Compatible	Length Extrapolation	KV Cache Compatible	Notes
Rotary (RoPE)	Yes	Yes (with NTK scaling)	Yes	Recommended default
ALiBi	Yes	Limited (strong local bias)	Yes	May hinder >1k token attending
Absolute Learned	Yes	No	No (outside max_seq_len)	Legacy; PaLM trend is to forgo these
T5 Relative Bias	No	Yes	Yes	Incompatible with flash attention
Dynamic Position Bias	No	Yes	Yes	Incompatible with flash attention
CoPE	No	Yes	Yes	Incompatible with flash attention

Reasoning

Rotary embeddings encode position through rotation of query/key vectors, preserving relative position information in the dot product without requiring an additive bias matrix. This means they work within the fused flash attention kernel (no attention matrix manipulation needed). The NTK-aware scaling trick (discovered by Reddit user bloc97) allows extending to longer sequences without retraining by rescaling the RoPE base frequency.

ALiBi provides strong local attention bias but reports suggest it may hinder attending at distances greater than 1k tokens. Absolute positional embeddings cannot cache KV pairs outside the trained `max_seq_len`, making them impractical for variable-length inference.

Code Evidence

Minimum dimension warning from `x_transformers.py:2355-2356`:

if verbose and rotary_emb_dim < 32:
    logger.warning('when training language model, rotary embedding dimension should be at least 32')

NTK-Aware base rescaling from `x_transformers.py:713-716`:

# proposed by reddit user bloc97, to rescale rotary embeddings
# to longer sequence length without fine-tuning
# has some connection to NTK literature
base *= base_rescale_factor ** (dim / (dim - 2))

Autocast disabled for positional precision from `x_transformers.py:739`:

@autocast('cuda', enabled = False)
def forward(self, t, offset = 0):

KV cache incompatibility with absolute embeddings from `autoregressive_wrapper.py:248`:

assert not (cache_kv and max_len_exceeded and not self.net.can_cache_kv_outside_max_seq_len), 'the network cannot use cached key values when decoding outside the max sequence length. most likely because you are using absolute positional embedding. you can switch to rotary embeddings to resolve this issue'

Partial rotary embeddings (GPT-J approach) from `x_transformers.py:779`:

# partial rotary embeddings, Wang et al. GPT-J
t, t_unrotated = t[..., :rot_dim], t[..., rot_dim:]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment