Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Fastai Fastbook Language Model Data

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Language Modeling, Data Engineering
Last Updated 2026-02-09 17:00 GMT

Overview

Language model data preparation is the process of organizing numericalized token sequences into contiguous streams and then slicing them into overlapping input-target pairs suitable for training an autoregressive language model.

Description

A language model is trained to predict the next token given all preceding tokens. To create training data for this task, the standard approach involves:

  1. Concatenation: All documents in the corpus are concatenated into a single long stream of token indices, separated by xxbos tokens at document boundaries.
  2. Reshaping into batches: The stream is divided into bs (batch size) roughly equal-length columns. This creates a matrix of shape (stream_length // bs, bs) where each column is a contiguous subsequence.
  3. Sequence windowing: A sliding window of length seq_len moves down the rows to produce input sequences x and target sequences y, where y is x shifted forward by one position.

This batching strategy is fundamentally different from classification data loading. In classification, each sample is independent. In language modeling, the hidden state of the RNN carries over from one batch to the next within the same column, allowing the model to learn dependencies that span multiple windows.

Usage

Use language model data loading when:

  • Fine-tuning a pretrained language model (e.g., AWD-LSTM pretrained on Wikitext-103) on a domain-specific corpus.
  • Training a language model from scratch on a new corpus.
  • The first stage of the ULMFiT three-step transfer learning pipeline.

This is distinct from classifier data loading where each sample has a label and samples are independent.

Theoretical Basis

Stream Concatenation and Batching

FUNCTION prepare_lm_data(documents, bs, seq_len):
    # Step 1: Concatenate all documents into one stream
    stream = []
    FOR EACH doc IN documents:
        stream.extend(doc)  # doc is already numericalized

    # Step 2: Trim to fit evenly into batches
    n = len(stream)
    trimmed_length = (n // bs) * bs
    stream = stream[:trimmed_length]

    # Step 3: Reshape into (num_rows, bs) matrix
    # Each column is a contiguous subsequence
    matrix = reshape(stream, shape=(-1, bs))
    # matrix shape: (trimmed_length // bs, bs)

    # Step 4: Generate (x, y) pairs via sliding window
    FOR i IN range(0, num_rows - 1, seq_len):
        x = matrix[i : i + seq_len, :]      # shape: (seq_len, bs)
        y = matrix[i + 1 : i + seq_len + 1, :]  # shifted by 1
        YIELD (x, y)

Why This Batching Strategy?

Consider a stream of 20 tokens with bs=4:

Stream: [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t]

Reshaped into 4 columns (bs=4), 5 rows:
     Col0  Col1  Col2  Col3
Row0:  a     f     k     p
Row1:  b     g     l     q
Row2:  c     h     m     r
Row3:  d     i     n     s
Row4:  e     j     o     t

With seq_len=2:
Batch 1: x = [[a,f,k,p], [b,g,l,q]]  y = [[b,g,l,q], [c,h,m,r]]
Batch 2: x = [[c,h,m,r], [d,i,n,s]]  y = [[d,i,n,s], [e,j,o,t]]

Key insight: Within each column, the tokens are contiguous from the original stream. The LSTM hidden state from the last token of batch 1, column 0 (token b) carries over to the first token of batch 2, column 0 (token c). This allows the model to learn long-range dependencies that extend beyond a single seq_len window.

Sequence Length Randomization

To prevent the model from learning to expect a fixed sequence length and to provide regularization, the actual sequence length for each batch is sampled from a distribution centered around seq_len:

FUNCTION sample_seq_len(target_seq_len):
    # With probability 0.95, use a slightly varied length
    # With probability 0.05, use half the target length
    IF random() < 0.05:
        effective_len = target_seq_len // 2
    ELSE:
        effective_len = target_seq_len

    # Add small random variation
    actual_len = max(1, effective_len + random_int(-5, 5))
    RETURN actual_len

This technique, borrowed from the AWD-LSTM paper (Merity et al., 2017), acts as a form of data augmentation and helps prevent overfitting to fixed-length patterns.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment