Principle:Lucidrains X transformers Encoder Decoder Configuration
Metadata
| Field | Value |
|---|---|
| Sources | Paper: Attention Is All You Need; Repo: x-transformers |
| Domains | Deep_Learning, NLP, Model_Architecture |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Architecture configuration pattern for defining encoder-decoder transformer models that map input sequences to output sequences using cross-attention.
Description
The encoder-decoder architecture consists of two stacks: an encoder that processes the input sequence bidirectionally, and a decoder that generates the output sequence autoregressively while attending to encoder outputs via cross-attention.
In the x-transformers library, XTransformer is a convenience class that bundles three components into a single module:
- Encoder
TransformerWrapper-- Wraps anEncoder(anAttentionLayerssubclass withcausal=False) that processes the input sequence using bidirectional self-attention. Configured toreturn_only_embed=Trueso that it outputs hidden states rather than logits. - Decoder
TransformerWrapper-- Wraps aDecoder(anAttentionLayerssubclass withcausal=Trueandcross_attend=True) that generates the output sequence autoregressively while attending to the encoder hidden states. AutoregressiveWrapper-- Wraps the decoderTransformerWrapperto provide automatic input/target splitting and loss computation during training, as well as autoregressive generation at inference time.
XTransformer provides a single interface for sequence-to-sequence tasks. Configuration uses enc_ and dec_ prefixed parameters to separately configure each side. For example, enc_depth=6 sets the encoder to 6 layers, while dec_depth=6 sets the decoder to 6 layers. The shared dim parameter sets the model dimension for both encoder and decoder.
Usage
Use this principle when building sequence-to-sequence models. The encoder-decoder architecture is appropriate for tasks where the input and output are different sequences, unlike decoder-only models which process a single sequence.
When to apply
- Machine translation -- Mapping a source language sentence to a target language sentence.
- Summarization -- Compressing a long input document into a shorter output summary.
- Copy tasks -- Learning to copy or transduce input sequences to output sequences (useful for testing and debugging).
- Sequence transduction -- Any task where the input sequence is fully observed before generation begins.
When not to apply
- You are building a decoder-only autoregressive language model (use
DecoderwithTransformerWrapperandAutoregressiveWrapperdirectly). - You are building an encoder-only model for classification or representation learning (use
EncoderwithTransformerWrapper). - The input and output are the same sequence (decoder-only models are typically more appropriate).
Theoretical Basis
Encoder-Decoder Attention
The encoder-decoder architecture, as introduced by Vaswani et al. (2017), connects the encoder and decoder through cross-attention (also called encoder-decoder attention). In each decoder layer, after the causal self-attention sub-layer, a cross-attention sub-layer allows the decoder to attend to the encoder hidden states:
CrossAttention(Q, K, V):
Q = decoder hidden states (from causal self-attention output)
K = encoder hidden states
V = encoder hidden states
Output = softmax(Q K^T / sqrt(d_k)) V
This mechanism allows each decoder position i to attend to all encoder positions, enabling the decoder to selectively focus on the relevant parts of the input when generating each output token.
Encoder: Bidirectional Self-Attention
The encoder uses non-causal (bidirectional) self-attention. Every position in the input sequence can attend to every other position, allowing the encoder to build rich contextual representations that incorporate information from the entire input:
Encoder Self-Attention: position i attends to all positions j in [0, ..., n-1]
No causal mask applied.
In x-transformers, the Encoder class is a subclass of AttentionLayers that sets causal=False.
Decoder: Causal Self-Attention + Cross-Attention
The decoder uses causal (masked) self-attention for its self-attention sub-layers, ensuring that each output position can only depend on previously generated tokens. In addition, each decoder layer includes a cross-attention sub-layer that attends to the encoder outputs:
Decoder layer:
1. Causal self-attention: position i attends to positions j where j <= i
2. Cross-attention: position i attends to ALL encoder positions
3. Feedforward network
In x-transformers, the Decoder class sets causal=True, and passing cross_attend=True adds cross-attention sub-layers that receive the encoder hidden states as context.
XTransformer Assembly
The XTransformer class handles the assembly by:
- Creating an
Encoder(non-causal) and wrapping it in aTransformerWrapperwithreturn_only_embed=True. - Creating a
Decoder(causal,cross_attend=True) and wrapping it in aTransformerWrapper, then further wrapping it in anAutoregressiveWrapper. - During forward pass, encoder hidden states are passed to the decoder as
contextfor cross-attention.