Principle:Fastai Fastbook Language Model Data
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Language Modeling, Data Engineering |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Language model data preparation is the process of organizing numericalized token sequences into contiguous streams and then slicing them into overlapping input-target pairs suitable for training an autoregressive language model.
Description
A language model is trained to predict the next token given all preceding tokens. To create training data for this task, the standard approach involves:
- Concatenation: All documents in the corpus are concatenated into a single long stream of token indices, separated by xxbos tokens at document boundaries.
- Reshaping into batches: The stream is divided into bs (batch size) roughly equal-length columns. This creates a matrix of shape (stream_length // bs, bs) where each column is a contiguous subsequence.
- Sequence windowing: A sliding window of length seq_len moves down the rows to produce input sequences x and target sequences y, where y is x shifted forward by one position.
This batching strategy is fundamentally different from classification data loading. In classification, each sample is independent. In language modeling, the hidden state of the RNN carries over from one batch to the next within the same column, allowing the model to learn dependencies that span multiple windows.
Usage
Use language model data loading when:
- Fine-tuning a pretrained language model (e.g., AWD-LSTM pretrained on Wikitext-103) on a domain-specific corpus.
- Training a language model from scratch on a new corpus.
- The first stage of the ULMFiT three-step transfer learning pipeline.
This is distinct from classifier data loading where each sample has a label and samples are independent.
Theoretical Basis
Stream Concatenation and Batching
FUNCTION prepare_lm_data(documents, bs, seq_len):
# Step 1: Concatenate all documents into one stream
stream = []
FOR EACH doc IN documents:
stream.extend(doc) # doc is already numericalized
# Step 2: Trim to fit evenly into batches
n = len(stream)
trimmed_length = (n // bs) * bs
stream = stream[:trimmed_length]
# Step 3: Reshape into (num_rows, bs) matrix
# Each column is a contiguous subsequence
matrix = reshape(stream, shape=(-1, bs))
# matrix shape: (trimmed_length // bs, bs)
# Step 4: Generate (x, y) pairs via sliding window
FOR i IN range(0, num_rows - 1, seq_len):
x = matrix[i : i + seq_len, :] # shape: (seq_len, bs)
y = matrix[i + 1 : i + seq_len + 1, :] # shifted by 1
YIELD (x, y)
Why This Batching Strategy?
Consider a stream of 20 tokens with bs=4:
Stream: [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t]
Reshaped into 4 columns (bs=4), 5 rows:
Col0 Col1 Col2 Col3
Row0: a f k p
Row1: b g l q
Row2: c h m r
Row3: d i n s
Row4: e j o t
With seq_len=2:
Batch 1: x = [[a,f,k,p], [b,g,l,q]] y = [[b,g,l,q], [c,h,m,r]]
Batch 2: x = [[c,h,m,r], [d,i,n,s]] y = [[d,i,n,s], [e,j,o,t]]
Key insight: Within each column, the tokens are contiguous from the original stream. The LSTM hidden state from the last token of batch 1, column 0 (token b) carries over to the first token of batch 2, column 0 (token c). This allows the model to learn long-range dependencies that extend beyond a single seq_len window.
Sequence Length Randomization
To prevent the model from learning to expect a fixed sequence length and to provide regularization, the actual sequence length for each batch is sampled from a distribution centered around seq_len:
FUNCTION sample_seq_len(target_seq_len):
# With probability 0.95, use a slightly varied length
# With probability 0.05, use half the target length
IF random() < 0.05:
effective_len = target_seq_len // 2
ELSE:
effective_len = target_seq_len
# Add small random variation
actual_len = max(1, effective_len + random_int(-5, 5))
RETURN actual_len
This technique, borrowed from the AWD-LSTM paper (Merity et al., 2017), acts as a form of data augmentation and helps prevent overfitting to fixed-length patterns.