Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Fastai Fastbook LMDataLoader

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Language Modeling
Last Updated 2026-02-09 17:00 GMT

Overview

Concrete tool for creating batched input-target pairs from numericalized text streams for language model training, provided by the fastai library.

Description

The fastai library provides two approaches for creating language model data loaders:

  • LMDataLoader: A low-level data loader that takes a pre-numericalized dataset and produces (x, y) batches where y is x shifted by one token position. It handles stream concatenation, batch reshaping, and sequence length randomization internally.
  • DataBlock with TextBlock(is_lm=True): A high-level API that combines tokenization, numericalization, and data loading into a single declarative pipeline. TextBlock.from_folder reads text files from a directory structure and handles all preprocessing automatically.

The DataBlock approach is preferred for most workflows because it integrates all preprocessing steps, handles train/validation splitting, and stores the vocabulary for later reuse by the classifier.

When is_lm=True is specified on TextBlock, the data block:

  1. Reads all text files from the specified folders.
  2. Tokenizes them using the fastai Tokenizer with spaCy.
  3. Numericalizes them using Numericalize.
  4. Wraps the result in LMDataLoader instances for training and validation.

Usage

Use these tools when setting up the first stage of the ULMFiT pipeline: fine-tuning a pretrained language model on domain-specific text. The resulting DataLoaders object is passed directly to language_model_learner.

Code Reference

Source Location

  • Repository: fastbook
  • File: translations/cn/10_nlp.md (lines 455-520)
  • Library module: fastai.text.data

Signature

class LMDataLoader(DataLoader):
    "DataLoader that creates language model batches from a numericalized dataset"
    def __init__(
        self,
        dataset,               # Numericalized token sequences
        bs: int = 64,          # Batch size (number of parallel streams)
        seq_len: int = 72,     # Target sequence length per batch
        num_workers: int = 0,  # Number of worker processes
        shuffle: bool = False, # Whether to shuffle (usually False for LM)
        **kwargs
    ):
        ...

# High-level DataBlock API
dls = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, bs=128, seq_len=80)

Import

from fastai.text.all import (
    LMDataLoader, TextBlock, DataBlock,
    get_text_files, RandomSplitter
)

I/O Contract

Inputs

Name Type Required Description
dataset list of TensorText Yes (for LMDataLoader) Pre-numericalized token sequences as integer tensors.
path Path Yes (for DataBlock) Root path containing the text file directory structure.
bs int No Batch size: number of parallel token streams processed simultaneously. Default: 64 (LMDataLoader) or 128 (typical DataBlock usage).
seq_len int No Target sequence length for each batch window. Default: 72 (LMDataLoader) or 80 (typical DataBlock usage).
is_lm bool Yes (for TextBlock) Must be True to create language model data. When True, targets are input shifted by 1 position rather than external labels.

Outputs

Name Type Description
dls DataLoaders A DataLoaders object containing train and validation LMDataLoader instances. Access via dls.train and dls.valid.
batch (x) TensorText Input tensor of shape (seq_len, bs) containing token indices.
batch (y) TensorText Target tensor of shape (seq_len, bs) containing token indices shifted by 1 position relative to x.
dls.vocab list of str The vocabulary built during numericalization. Must be saved and reused for classifier data preparation.

Usage Examples

Basic Usage with DataBlock API

from fastai.text.all import *

path = untar_data(URLs.IMDB)

# Create language model DataLoaders using the high-level API
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

# Inspect the DataLoaders
dls_lm.show_batch(max_n=2)

Inspecting Batch Structure

from fastai.text.all import *

path = untar_data(URLs.IMDB)

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

# Get a single batch and inspect shapes
x, y = dls_lm.one_batch()
print(f"Input shape:  {x.shape}")   # torch.Size([80, 128])
print(f"Target shape: {y.shape}")   # torch.Size([80, 128])

# Verify y is x shifted by 1
# The target for position i is the token at position i+1
print(f"x[0, 0] = {x[0, 0]}")  # First input token
print(f"y[0, 0] = {y[0, 0]}")  # Should be second token in sequence

Saving Vocabulary for Classifier Reuse

from fastai.text.all import *

path = untar_data(URLs.IMDB)

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

# The vocabulary is accessible from the DataLoaders
# This MUST be reused when creating classifier DataLoaders
vocab = dls_lm.vocab
print(f"Vocabulary size: {len(vocab)}")
# Output: ~60,000 (capped by max_vocab)

print(f"First 10 tokens: {vocab[:10]}")
# Output: ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', 'the']

Low-level LMDataLoader Usage

from fastai.text.all import LMDataLoader, TensorText
import torch

# Suppose we have pre-numericalized data as a list of tensors
nums = [torch.randint(0, 1000, (500,)) for _ in range(100)]

# Create an LMDataLoader directly
dl = LMDataLoader(nums, bs=64, seq_len=72)

# Iterate over batches
for x, y in dl:
    print(f"Batch x shape: {x.shape}")  # (seq_len, 64)
    print(f"Batch y shape: {y.shape}")  # (seq_len, 64)
    break

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment