Principle:Lucidrains X transformers Causal Decoder Configuration
Metadata
| Field | Value |
|---|---|
| Sources | Paper: Attention Is All You Need; Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding; Repo: x-transformers |
| Domains | Deep_Learning, NLP, Model_Architecture |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Architecture configuration pattern for defining causal decoder-only transformer models with customizable attention layers, positional encodings, and normalization strategies.
Description
Configuring a causal decoder-only transformer involves selecting a set of core hyperparameters that together define the model architecture before any training begins. The primary parameters are:
dim-- The model dimension (hidden size). Every token embedding, attention projection, and feedforward layer operates at this dimensionality. Larger values increase model capacity but also memory and compute cost.depth-- The number of transformer layers stacked sequentially. Each layer consists of a masked self-attention sub-layer followed by a position-wise feedforward sub-layer. Deeper models can learn more complex representations but are harder to train.heads-- The number of parallel attention heads. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. The per-head dimension is typicallydim / heads.- Positional encoding strategy -- Because self-attention is permutation-equivariant, transformers require an explicit mechanism to encode token order. Common strategies include:
- Absolute sinusoidal position embeddings -- The original approach from Vaswani et al. (2017). Fixed sinusoidal functions of different frequencies provide a unique encoding per position.
- Learned absolute position embeddings -- A learnable embedding table indexed by position. Simple but limited to the maximum training sequence length.
- Rotary Position Embeddings (RoPE) -- Introduced in the RoFormer paper. Rotary embeddings encode relative position by rotating query and key vectors in pairs of dimensions, enabling better length generalization beyond training context.
- ALiBi (Attention with Linear Biases) -- Adds a fixed linear bias to attention scores based on the distance between query and key positions. Provides strong extrapolation to unseen sequence lengths without any learned positional parameters.
In the x-transformers library, the architecture is composed from two principal classes:
AttentionLayers-- The core class that constructs the stack of attention and feedforward layers. It acceptsdim,depth,heads, normalization options (use_rmsnorm,pre_norm,sandwich_norm), positional encoding flags (rotary_pos_emb,alibi_pos_bias), and dozens of other configuration knobs for advanced features such as residual gating, layer dropout, and hyper-connections.TransformerWrapper-- A wrapper that takes anAttentionLayersinstance and adds token embeddings, positional embeddings, and an output projection head (logits layer). It is the outermost module that accepts token IDs as input and produces logits as output.
The Decoder class is a thin convenience subclass of AttentionLayers that forces causal=True. This is the canonical way to build an autoregressive (GPT-style) language model: instantiate a Decoder with the desired hyperparameters and pass it as the attn_layers argument to TransformerWrapper.
This configuration step is the foundational first step for building autoregressive language models. Every subsequent concern -- training loop, optimizer selection, data loading, inference -- depends on the architecture choices made here.
Usage
Use this principle when building a GPT-style causal language model. The configuration determines:
- Model capacity -- Controlled primarily by
dim,depth, andheads. Typical small models usedim=512, depth=6, heads=8; larger models scale todim=4096, depth=32, heads=32and beyond. - Attention mechanism -- Whether to use standard dot-product attention, multi-query attention, grouped-query attention, or other variants (configured via
attn_-prefixed keyword arguments passed through toAttentionLayers). - Positional encoding -- Choose rotary embeddings (
rotary_pos_emb=True) for strong length generalization and compatibility with modern architectures (LLaMA, PaLM). Choose ALiBi (alibi_pos_bias=True) when extrapolation to very long sequences is critical and you want zero additional parameters. - Normalization -- Pre-norm (default) vs. post-norm, RMSNorm vs. LayerNorm, sandwich norm, etc.
When to apply
- You are starting a new autoregressive language model project.
- You need to choose between positional encoding strategies for a specific length-generalization requirement.
- You are scaling a model up or down and need to understand how
dim,depth, andheadsinteract.
When not to apply
- You are building an encoder-only model (use
Encoderinstead ofDecoder). - You are building an encoder-decoder model (use both
EncoderandDecoderwith cross-attention). - You need a non-autoregressive model (consider
NonAutoregressiveWrapper).
Theoretical Basis
Transformer Decoder Architecture
The transformer decoder, as introduced by Vaswani et al. (2017), consists of a stack of N identical layers. Each layer has two sub-layers:
- Masked multi-head self-attention -- Allows each position to attend to all previous positions (and itself), but not to future positions.
- Position-wise feedforward network -- A two-layer MLP applied independently to each position:
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2.
Residual connections and layer normalization are applied around each sub-layer.
Causal Masking
The defining property of a causal (autoregressive) decoder is the causal mask. During self-attention, position i can only attend to positions j where j is less than or equal to i. This is enforced by setting the upper-triangular portion of the attention score matrix to negative infinity before the softmax:
Attention(Q, K, V) = softmax( (Q K^T / sqrt(d_k)) + M ) V
where M is the causal mask matrix:
M[i][j] = 0 if j <= i
M[i][j] = -inf if j > i
This ensures that the output at each position depends only on the known outputs at earlier positions, which is essential for autoregressive generation where tokens are produced one at a time from left to right.
Scaled Dot-Product Attention
The core attention computation is:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
where:
- Q (queries), K (keys), V (values) are linear projections of the input.
- d_k is the dimension of the key vectors (typically
dim / heads). - The scaling factor
1 / sqrt(d_k)prevents the dot products from growing too large in magnitude, which would push the softmax into regions of extremely small gradients.
Multi-Head Attention
Rather than performing a single attention function, multi-head attention runs h parallel attention heads, each with its own learned projections:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)
This allows different heads to focus on different types of relationships (e.g., syntactic vs. semantic, local vs. long-range).
Rotary Position Embeddings (RoPE)
Rotary embeddings, introduced by Su et al. (2021), are a modern alternative to absolute sinusoidal position embeddings. Rather than adding position information to the input, RoPE applies a rotation to the query and key vectors based on their absolute position. The key insight is that the dot product between a rotated query at position m and a rotated key at position n depends only on their relative position m - n:
<q_m, k_n> = Re[(q * e^{i*m*theta})^* (k * e^{i*n*theta})]
= Re[(q^* k) * e^{i*(n-m)*theta}]
This gives RoPE several advantages:
- Relative position awareness without explicit relative position bias terms.
- Better length generalization compared to learned absolute embeddings.
- No additional parameters -- the rotations are deterministic functions of position.
RoPE has been adopted by most modern large language models including LLaMA, PaLM, and Gemma.