Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lucidrains X transformers Autoregressive Data Preparation

From Leeroopedia


Field Value
Repo x-transformers
Domains Data_Engineering, NLP
Last Updated 2026-02-08 18:00 GMT

Overview

Data preparation pattern for creating fixed-length token sequence datasets suitable for autoregressive language model training.

Description

Autoregressive training requires datasets that yield token sequences of length seq_len + 1 (the extra token provides the target for the last input position). The dataset samples random contiguous subsequences from a larger corpus. A cycling DataLoader provides infinite batches. The x-transformers train_enwik8.py example demonstrates this with a TextSamplerDataset class.

The key characteristics of this data preparation pattern are:

  • Each sample is a contiguous subsequence drawn from the tokenized corpus.
  • The length of each sample is exactly seq_len + 1 tokens.
  • A random starting position is chosen for each sample, providing diversity across epochs.
  • The dataset is wrapped in a cycling DataLoader that yields infinite batches without exhausting the iterator.

Usage

Use this pattern when preparing data for autoregressive training with AutoregressiveWrapper. Data must be tokenized to integer IDs and sequences must include one extra token for the target. Specifically:

  • Tokenize your raw text corpus into a 1D tensor of integer token IDs.
  • Split the tokenized data into training and validation sets.
  • Create a Dataset subclass that returns sequences of shape (seq_len + 1,).
  • Wrap the dataset in a cycling DataLoader for infinite iteration.

Theoretical Basis

For teacher forcing, the input and target are derived from the same sequence:

  • Input = tokens[:-1] (first seq_len tokens)
  • Target = tokens[1:] (last seq_len tokens, shifted by one)

Therefore each training sample must be seq_len + 1 tokens long so that both input and target are exactly seq_len tokens.

Random sampling from a large corpus provides data augmentation and ensures the model sees diverse contexts. Since the starting position is uniformly random, the model is exposed to different context windows across training iterations, reducing overfitting to fixed boundaries.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment