Heuristic:Lucidrains X transformers Rotary Position Embedding Selection
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, LLMs |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Guide for selecting and configuring positional embeddings in x-transformers, with rotary embeddings as the strongly recommended default.
Description
x-transformers supports multiple positional embedding strategies: rotary (RoPE), ALiBi, absolute learned embeddings, dynamic position bias, T5 relative bias, CoPE, and polar embeddings. The README and code strongly recommend rotary embeddings for most use cases. This heuristic captures the selection criteria and configuration details.
Usage
Use this heuristic when configuring a new model and choosing between positional embedding strategies, or when experiencing issues with sequence length generalization or training stability.
The Insight (Rule of Thumb)
- Action: Set `rotary_pos_emb=True` on Decoder/Encoder attention layers for ordered sequences.
- Value: Boolean flag with optional dimension tuning via `rotary_emb_dim`.
- Trade-off: Rotary embeddings are compatible with flash attention, KV caching, and sequence length extrapolation. They are the most versatile choice.
- Minimum dimension: `rotary_emb_dim` should be >= 32 for language models (warned in code).
- NTK-Aware scaling: For sequence length extrapolation without fine-tuning, use `rotary_base_rescale_factor` > 1.0.
Comparison Table
| Method | Flash Compatible | Length Extrapolation | KV Cache Compatible | Notes |
|---|---|---|---|---|
| Rotary (RoPE) | Yes | Yes (with NTK scaling) | Yes | Recommended default |
| ALiBi | Yes | Limited (strong local bias) | Yes | May hinder >1k token attending |
| Absolute Learned | Yes | No | No (outside max_seq_len) | Legacy; PaLM trend is to forgo these |
| T5 Relative Bias | No | Yes | Yes | Incompatible with flash attention |
| Dynamic Position Bias | No | Yes | Yes | Incompatible with flash attention |
| CoPE | No | Yes | Yes | Incompatible with flash attention |
Reasoning
Rotary embeddings encode position through rotation of query/key vectors, preserving relative position information in the dot product without requiring an additive bias matrix. This means they work within the fused flash attention kernel (no attention matrix manipulation needed). The NTK-aware scaling trick (discovered by Reddit user bloc97) allows extending to longer sequences without retraining by rescaling the RoPE base frequency.
ALiBi provides strong local attention bias but reports suggest it may hinder attending at distances greater than 1k tokens. Absolute positional embeddings cannot cache KV pairs outside the trained `max_seq_len`, making them impractical for variable-length inference.
Code Evidence
Minimum dimension warning from `x_transformers.py:2355-2356`:
if verbose and rotary_emb_dim < 32:
logger.warning('when training language model, rotary embedding dimension should be at least 32')
NTK-Aware base rescaling from `x_transformers.py:713-716`:
# proposed by reddit user bloc97, to rescale rotary embeddings
# to longer sequence length without fine-tuning
# has some connection to NTK literature
base *= base_rescale_factor ** (dim / (dim - 2))
Autocast disabled for positional precision from `x_transformers.py:739`:
@autocast('cuda', enabled = False)
def forward(self, t, offset = 0):
KV cache incompatibility with absolute embeddings from `autoregressive_wrapper.py:248`:
assert not (cache_kv and max_len_exceeded and not self.net.can_cache_kv_outside_max_seq_len), 'the network cannot use cached key values when decoding outside the max sequence length. most likely because you are using absolute positional embedding. you can switch to rotary embeddings to resolve this issue'
Partial rotary embeddings (GPT-J approach) from `x_transformers.py:779`:
# partial rotary embeddings, Wang et al. GPT-J
t, t_unrotated = t[..., :rot_dim], t[..., rot_dim:]
Related Pages
- Implementation:Lucidrains_X_transformers_TransformerWrapper_Decoder_Init
- Implementation:Lucidrains_X_transformers_TransformerWrapper_Encoder_Init
- Implementation:Lucidrains_X_transformers_XTransformer_Init
- Principle:Lucidrains_X_transformers_Causal_Decoder_Configuration
- Principle:Lucidrains_X_transformers_Bidirectional_Encoder_Configuration
- Principle:Lucidrains_X_transformers_Encoder_Decoder_Configuration