Principle:Lucidrains X transformers Sequence to Sequence Data Preparation

Field	Value
Repo	x-transformers
Domains	Data_Engineering, NLP
Last Updated	2026-02-08 18:00 GMT

Overview

Data preparation pattern for creating paired source-target sequence datasets suitable for encoder-decoder transformer training.

Description

Encoder-decoder training requires (source, target, source_mask) tuples. The source sequence is fed to the encoder, the target sequence to the decoder (which is auto-shifted internally by AutoregressiveWrapper). Source masks are boolean tensors indicating valid (non-padding) positions in the source sequence.

The key requirements of this pattern are:

Source sequence: A tensor of integer token IDs representing the input to the encoder.
Target sequence: A tensor of integer token IDs representing the expected decoder output. This often starts with a special prefix or start token.
Source mask: A boolean tensor of the same length as the source, where True indicates a valid token position and False indicates padding.

The train_copy.py example in the x-transformers repository demonstrates this pattern with a cycle() generator that produces these tuples on the fly.

Usage

Use this pattern when preparing data for XTransformer (encoder-decoder) training. Provide source tokens, target tokens (with prefix), and a source padding mask. Specifically:

Tokenize both source and target text into integer ID sequences.
Prepend a start-of-sequence token to the target sequence.
Create a boolean mask for the source to handle variable-length inputs within a batch.
Yield (src, tgt, src_mask) tuples from a generator or DataLoader.

Theoretical Basis

In sequence-to-sequence models, each training sample is a pair (x, y) where x is the source sequence and y is the target sequence. The decoder receives y as input and predicts y shifted by one position (this shifting is handled internally by AutoregressiveWrapper).

The source mask ensures the encoder ignores padding tokens. This is critical for batched training where source sequences may have different lengths. Without proper masking, the encoder would attend to meaningless padding positions, degrading model quality.

The target sequence typically begins with a prefix token (e.g., a start-of-sequence token with ID 1) that signals the beginning of generation. During inference, this prefix token is provided as the initial input to the decoder for autoregressive generation.

Related Pages

Implementation:Lucidrains_X_transformers_Paired_Sequence_Generator_Pattern

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment