Implementation:Fastai Fastbook LMDataLoader
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Language Modeling |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Concrete tool for creating batched input-target pairs from numericalized text streams for language model training, provided by the fastai library.
Description
The fastai library provides two approaches for creating language model data loaders:
- LMDataLoader: A low-level data loader that takes a pre-numericalized dataset and produces (x, y) batches where y is x shifted by one token position. It handles stream concatenation, batch reshaping, and sequence length randomization internally.
- DataBlock with TextBlock(is_lm=True): A high-level API that combines tokenization, numericalization, and data loading into a single declarative pipeline. TextBlock.from_folder reads text files from a directory structure and handles all preprocessing automatically.
The DataBlock approach is preferred for most workflows because it integrates all preprocessing steps, handles train/validation splitting, and stores the vocabulary for later reuse by the classifier.
When is_lm=True is specified on TextBlock, the data block:
- Reads all text files from the specified folders.
- Tokenizes them using the fastai Tokenizer with spaCy.
- Numericalizes them using Numericalize.
- Wraps the result in LMDataLoader instances for training and validation.
Usage
Use these tools when setting up the first stage of the ULMFiT pipeline: fine-tuning a pretrained language model on domain-specific text. The resulting DataLoaders object is passed directly to language_model_learner.
Code Reference
Source Location
- Repository: fastbook
- File: translations/cn/10_nlp.md (lines 455-520)
- Library module: fastai.text.data
Signature
class LMDataLoader(DataLoader):
"DataLoader that creates language model batches from a numericalized dataset"
def __init__(
self,
dataset, # Numericalized token sequences
bs: int = 64, # Batch size (number of parallel streams)
seq_len: int = 72, # Target sequence length per batch
num_workers: int = 0, # Number of worker processes
shuffle: bool = False, # Whether to shuffle (usually False for LM)
**kwargs
):
...
# High-level DataBlock API
dls = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_text_files,
splitter=RandomSplitter(0.1)
).dataloaders(path, bs=128, seq_len=80)
Import
from fastai.text.all import (
LMDataLoader, TextBlock, DataBlock,
get_text_files, RandomSplitter
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | list of TensorText | Yes (for LMDataLoader) | Pre-numericalized token sequences as integer tensors. |
| path | Path | Yes (for DataBlock) | Root path containing the text file directory structure. |
| bs | int | No | Batch size: number of parallel token streams processed simultaneously. Default: 64 (LMDataLoader) or 128 (typical DataBlock usage). |
| seq_len | int | No | Target sequence length for each batch window. Default: 72 (LMDataLoader) or 80 (typical DataBlock usage). |
| is_lm | bool | Yes (for TextBlock) | Must be True to create language model data. When True, targets are input shifted by 1 position rather than external labels. |
Outputs
| Name | Type | Description |
|---|---|---|
| dls | DataLoaders | A DataLoaders object containing train and validation LMDataLoader instances. Access via dls.train and dls.valid. |
| batch (x) | TensorText | Input tensor of shape (seq_len, bs) containing token indices. |
| batch (y) | TensorText | Target tensor of shape (seq_len, bs) containing token indices shifted by 1 position relative to x. |
| dls.vocab | list of str | The vocabulary built during numericalization. Must be saved and reused for classifier data preparation. |
Usage Examples
Basic Usage with DataBlock API
from fastai.text.all import *
path = untar_data(URLs.IMDB)
# Create language model DataLoaders using the high-level API
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_text_files,
splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
# Inspect the DataLoaders
dls_lm.show_batch(max_n=2)
Inspecting Batch Structure
from fastai.text.all import *
path = untar_data(URLs.IMDB)
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_text_files,
splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
# Get a single batch and inspect shapes
x, y = dls_lm.one_batch()
print(f"Input shape: {x.shape}") # torch.Size([80, 128])
print(f"Target shape: {y.shape}") # torch.Size([80, 128])
# Verify y is x shifted by 1
# The target for position i is the token at position i+1
print(f"x[0, 0] = {x[0, 0]}") # First input token
print(f"y[0, 0] = {y[0, 0]}") # Should be second token in sequence
Saving Vocabulary for Classifier Reuse
from fastai.text.all import *
path = untar_data(URLs.IMDB)
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_text_files,
splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
# The vocabulary is accessible from the DataLoaders
# This MUST be reused when creating classifier DataLoaders
vocab = dls_lm.vocab
print(f"Vocabulary size: {len(vocab)}")
# Output: ~60,000 (capped by max_vocab)
print(f"First 10 tokens: {vocab[:10]}")
# Output: ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', 'the']
Low-level LMDataLoader Usage
from fastai.text.all import LMDataLoader, TensorText
import torch
# Suppose we have pre-numericalized data as a list of tensors
nums = [torch.randint(0, 1000, (500,)) for _ in range(100)]
# Create an LMDataLoader directly
dl = LMDataLoader(nums, bs=64, seq_len=72)
# Iterate over batches
for x, y in dl:
print(f"Batch x shape: {x.shape}") # (seq_len, 64)
print(f"Batch y shape: {y.shape}") # (seq_len, 64)
break