Implementation:Fastai Fastbook LMDataLoader

Knowledge Sources	fastbook fastai docs fastai.text.data
Domains	Natural Language Processing, Language Modeling
Last Updated	2026-02-09 17:00 GMT

Overview

Concrete tool for creating batched input-target pairs from numericalized text streams for language model training, provided by the fastai library.

Description

The fastai library provides two approaches for creating language model data loaders:

LMDataLoader: A low-level data loader that takes a pre-numericalized dataset and produces (x, y) batches where y is x shifted by one token position. It handles stream concatenation, batch reshaping, and sequence length randomization internally.
DataBlock with TextBlock(is_lm=True): A high-level API that combines tokenization, numericalization, and data loading into a single declarative pipeline. TextBlock.from_folder reads text files from a directory structure and handles all preprocessing automatically.

The DataBlock approach is preferred for most workflows because it integrates all preprocessing steps, handles train/validation splitting, and stores the vocabulary for later reuse by the classifier.

When is_lm=True is specified on TextBlock, the data block:

Reads all text files from the specified folders.
Tokenizes them using the fastai Tokenizer with spaCy.
Numericalizes them using Numericalize.
Wraps the result in LMDataLoader instances for training and validation.

Usage

Use these tools when setting up the first stage of the ULMFiT pipeline: fine-tuning a pretrained language model on domain-specific text. The resulting DataLoaders object is passed directly to language_model_learner.

Code Reference

Source Location

Repository: fastbook
File: translations/cn/10_nlp.md (lines 455-520)
Library module: fastai.text.data

Signature

class LMDataLoader(DataLoader):
    "DataLoader that creates language model batches from a numericalized dataset"
    def __init__(
        self,
        dataset,               # Numericalized token sequences
        bs: int = 64,          # Batch size (number of parallel streams)
        seq_len: int = 72,     # Target sequence length per batch
        num_workers: int = 0,  # Number of worker processes
        shuffle: bool = False, # Whether to shuffle (usually False for LM)
        **kwargs
    ):
        ...

# High-level DataBlock API
dls = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, bs=128, seq_len=80)

Import

from fastai.text.all import (
    LMDataLoader, TextBlock, DataBlock,
    get_text_files, RandomSplitter
)

I/O Contract

Inputs

Name	Type	Required	Description
dataset	list of TensorText	Yes (for LMDataLoader)	Pre-numericalized token sequences as integer tensors.
path	Path	Yes (for DataBlock)	Root path containing the text file directory structure.
bs	int	No	Batch size: number of parallel token streams processed simultaneously. Default: 64 (LMDataLoader) or 128 (typical DataBlock usage).
seq_len	int	No	Target sequence length for each batch window. Default: 72 (LMDataLoader) or 80 (typical DataBlock usage).
is_lm	bool	Yes (for TextBlock)	Must be True to create language model data. When True, targets are input shifted by 1 position rather than external labels.

Outputs

Name	Type	Description
dls	DataLoaders	A DataLoaders object containing train and validation LMDataLoader instances. Access via dls.train and dls.valid.
batch (x)	TensorText	Input tensor of shape (seq_len, bs) containing token indices.
batch (y)	TensorText	Target tensor of shape (seq_len, bs) containing token indices shifted by 1 position relative to x.
dls.vocab	list of str	The vocabulary built during numericalization. Must be saved and reused for classifier data preparation.

Usage Examples

Basic Usage with DataBlock API

from fastai.text.all import *

path = untar_data(URLs.IMDB)

# Create language model DataLoaders using the high-level API
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

# Inspect the DataLoaders
dls_lm.show_batch(max_n=2)

Inspecting Batch Structure

from fastai.text.all import *

path = untar_data(URLs.IMDB)

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

# Get a single batch and inspect shapes
x, y = dls_lm.one_batch()
print(f"Input shape:  {x.shape}")   # torch.Size([80, 128])
print(f"Target shape: {y.shape}")   # torch.Size([80, 128])

# Verify y is x shifted by 1
# The target for position i is the token at position i+1
print(f"x[0, 0] = {x[0, 0]}")  # First input token
print(f"y[0, 0] = {y[0, 0]}")  # Should be second token in sequence

Saving Vocabulary for Classifier Reuse

from fastai.text.all import *

path = untar_data(URLs.IMDB)

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

# The vocabulary is accessible from the DataLoaders
# This MUST be reused when creating classifier DataLoaders
vocab = dls_lm.vocab
print(f"Vocabulary size: {len(vocab)}")
# Output: ~60,000 (capped by max_vocab)

print(f"First 10 tokens: {vocab[:10]}")
# Output: ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', 'the']

Low-level LMDataLoader Usage

from fastai.text.all import LMDataLoader, TensorText
import torch

# Suppose we have pre-numericalized data as a list of tensors
nums = [torch.randint(0, 1000, (500,)) for _ in range(100)]

# Create an LMDataLoader directly
dl = LMDataLoader(nums, bs=64, seq_len=72)

# Iterate over batches
for x, y in dl:
    print(f"Batch x shape: {x.shape}")  # (seq_len, 64)
    print(f"Batch y shape: {y.shape}")  # (seq_len, 64)
    break

Related Pages

Implements Principle

Principle:Fastai_Fastbook_Language_Model_Data

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment