Principle:Lucidrains X transformers Encoder Decoder Configuration

Metadata

Field	Value
Sources	Paper: Attention Is All You Need; Repo: x-transformers
Domains	Deep_Learning, NLP, Model_Architecture
Last Updated	2026-02-08 18:00 GMT

Overview

Architecture configuration pattern for defining encoder-decoder transformer models that map input sequences to output sequences using cross-attention.

Description

The encoder-decoder architecture consists of two stacks: an encoder that processes the input sequence bidirectionally, and a decoder that generates the output sequence autoregressively while attending to encoder outputs via cross-attention.

In the x-transformers library, XTransformer is a convenience class that bundles three components into a single module:

Encoder TransformerWrapper -- Wraps an Encoder (an AttentionLayers subclass with causal=False) that processes the input sequence using bidirectional self-attention. Configured to return_only_embed=True so that it outputs hidden states rather than logits.
Decoder TransformerWrapper -- Wraps a Decoder (an AttentionLayers subclass with causal=True and cross_attend=True) that generates the output sequence autoregressively while attending to the encoder hidden states.
AutoregressiveWrapper -- Wraps the decoder TransformerWrapper to provide automatic input/target splitting and loss computation during training, as well as autoregressive generation at inference time.

XTransformer provides a single interface for sequence-to-sequence tasks. Configuration uses enc_ and dec_ prefixed parameters to separately configure each side. For example, enc_depth=6 sets the encoder to 6 layers, while dec_depth=6 sets the decoder to 6 layers. The shared dim parameter sets the model dimension for both encoder and decoder.

Usage

Use this principle when building sequence-to-sequence models. The encoder-decoder architecture is appropriate for tasks where the input and output are different sequences, unlike decoder-only models which process a single sequence.

When to apply

Machine translation -- Mapping a source language sentence to a target language sentence.
Summarization -- Compressing a long input document into a shorter output summary.
Copy tasks -- Learning to copy or transduce input sequences to output sequences (useful for testing and debugging).
Sequence transduction -- Any task where the input sequence is fully observed before generation begins.

When not to apply

You are building a decoder-only autoregressive language model (use Decoder with TransformerWrapper and AutoregressiveWrapper directly).
You are building an encoder-only model for classification or representation learning (use Encoder with TransformerWrapper).
The input and output are the same sequence (decoder-only models are typically more appropriate).

Theoretical Basis

Encoder-Decoder Attention

The encoder-decoder architecture, as introduced by Vaswani et al. (2017), connects the encoder and decoder through cross-attention (also called encoder-decoder attention). In each decoder layer, after the causal self-attention sub-layer, a cross-attention sub-layer allows the decoder to attend to the encoder hidden states:

CrossAttention(Q, K, V):
    Q = decoder hidden states (from causal self-attention output)
    K = encoder hidden states
    V = encoder hidden states
    Output = softmax(Q K^T / sqrt(d_k)) V

This mechanism allows each decoder position i to attend to all encoder positions, enabling the decoder to selectively focus on the relevant parts of the input when generating each output token.

Encoder: Bidirectional Self-Attention

The encoder uses non-causal (bidirectional) self-attention. Every position in the input sequence can attend to every other position, allowing the encoder to build rich contextual representations that incorporate information from the entire input:

Encoder Self-Attention: position i attends to all positions j in [0, ..., n-1]
No causal mask applied.

In x-transformers, the Encoder class is a subclass of AttentionLayers that sets causal=False.

Decoder: Causal Self-Attention + Cross-Attention

The decoder uses causal (masked) self-attention for its self-attention sub-layers, ensuring that each output position can only depend on previously generated tokens. In addition, each decoder layer includes a cross-attention sub-layer that attends to the encoder outputs:

Decoder layer:
    1. Causal self-attention: position i attends to positions j where j <= i
    2. Cross-attention: position i attends to ALL encoder positions
    3. Feedforward network

In x-transformers, the Decoder class sets causal=True, and passing cross_attend=True adds cross-attention sub-layers that receive the encoder hidden states as context.

XTransformer Assembly

The XTransformer class handles the assembly by:

Creating an Encoder (non-causal) and wrapping it in a TransformerWrapper with return_only_embed=True.
Creating a Decoder (causal, cross_attend=True) and wrapping it in a TransformerWrapper, then further wrapping it in an AutoregressiveWrapper.
During forward pass, encoder hidden states are passed to the decoder as context for cross-attention.

Related Pages

Implemented By

Implementation:Lucidrains_X_transformers_XTransformer_Init

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment