Principle:Lucidrains X transformers Autoregressive Data Preparation
| Field | Value |
|---|---|
| Repo | x-transformers |
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Data preparation pattern for creating fixed-length token sequence datasets suitable for autoregressive language model training.
Description
Autoregressive training requires datasets that yield token sequences of length seq_len + 1 (the extra token provides the target for the last input position). The dataset samples random contiguous subsequences from a larger corpus. A cycling DataLoader provides infinite batches. The x-transformers train_enwik8.py example demonstrates this with a TextSamplerDataset class.
The key characteristics of this data preparation pattern are:
- Each sample is a contiguous subsequence drawn from the tokenized corpus.
- The length of each sample is exactly seq_len + 1 tokens.
- A random starting position is chosen for each sample, providing diversity across epochs.
- The dataset is wrapped in a cycling
DataLoaderthat yields infinite batches without exhausting the iterator.
Usage
Use this pattern when preparing data for autoregressive training with AutoregressiveWrapper. Data must be tokenized to integer IDs and sequences must include one extra token for the target. Specifically:
- Tokenize your raw text corpus into a 1D tensor of integer token IDs.
- Split the tokenized data into training and validation sets.
- Create a
Datasetsubclass that returns sequences of shape(seq_len + 1,). - Wrap the dataset in a cycling
DataLoaderfor infinite iteration.
Theoretical Basis
For teacher forcing, the input and target are derived from the same sequence:
- Input =
tokens[:-1](first seq_len tokens) - Target =
tokens[1:](last seq_len tokens, shifted by one)
Therefore each training sample must be seq_len + 1 tokens long so that both input and target are exactly seq_len tokens.
Random sampling from a large corpus provides data augmentation and ensures the model sees diverse contexts. Since the starting position is uniformly random, the model is exposed to different context windows across training iterations, reducing overfitting to fixed boundaries.