Implementation:Fastai Fastbook Text Classifier DataLoaders

Knowledge Sources	fastbook fastai docs fastai.text.data
Domains	Natural Language Processing, Text Classification
Last Updated	2026-02-09 17:00 GMT

Overview

Concrete tool for creating classification-ready DataLoaders that pair tokenized text documents with category labels while preserving vocabulary alignment with a pretrained language model, provided by the fastai library.

Description

The classifier DataLoaders are built using the fastai DataBlock API with specific configuration for text classification:

TextBlock.from_folder(path, vocab=dls_lm.vocab): Creates a text processing block that tokenizes and numericalizes text files, but critically uses the vocabulary from the language model DataLoaders rather than building a new one. The vocab parameter is the key mechanism for ensuring vocabulary consistency.
CategoryBlock: Creates a categorical target block that maps label strings ("pos", "neg") to integer indices.
get_y=parent_label: A function that extracts the label from the parent directory name of each text file.
splitter=GrandparentSplitter(valid_name='test'): Splits data into train and validation sets based on the grandparent directory name, using "test" as the validation set identifier.

The resulting DataLoaders sort documents by length within each batch and apply padding to handle variable-length sequences efficiently.

Usage

Use this DataBlock configuration after completing language model fine-tuning. The dls_lm.vocab must be available from the language model training step. This creates the data pipeline for the final classification training stage.

Code Reference

Source Location

Repository: fastbook
File: translations/cn/10_nlp.md (lines 656-661)
Library module: fastai.text.data

Signature

dls_clas = DataBlock(
    blocks=(
        TextBlock.from_folder(path, vocab=dls_lm.vocab),
        CategoryBlock
    ),
    get_y=parent_label,
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

Import

from fastai.text.all import (
    TextBlock, CategoryBlock, DataBlock,
    GrandparentSplitter, parent_label
)

I/O Contract

Inputs

Name	Type	Required	Description
path	Path	Yes	Root path of the dataset directory (e.g., the IMDb dataset root containing train/ and test/ subdirectories).
vocab	list of str	Yes	Vocabulary from the language model DataLoaders. Passed as dls_lm.vocab to ensure token-to-index alignment with the pretrained encoder.
get_y	callable	Yes	Function to extract labels from file paths. parent_label returns the parent directory name as the label string.
splitter	callable	Yes	Splitting strategy. GrandparentSplitter(valid_name='test') uses the grandparent directory to assign files to train or validation.
bs	int	No	Batch size. Default: 128. May need to be reduced for GPU memory constraints since variable-length documents require more memory than fixed-length LM batches.
seq_len	int	No	Maximum sequence length for gradual sequence length increase during training. Default: 72.

Outputs

Name	Type	Description
dls_clas	DataLoaders	A DataLoaders object with train and validation loaders. Training loader yields (text_tensor, label_tensor) pairs.
batch (x)	TensorText	Padded text tensor of shape (bs, max_seq_len_in_batch). Each row is a numericalized document, padded with the xxpad token index.
batch (y)	TensorCategory	Label tensor of shape (bs,) with integer category indices (0 for neg, 1 for pos, or vice versa).
dls_clas.vocab	list of str	The shared vocabulary (same as input vocab).

Usage Examples

Basic Usage

from fastai.text.all import *

path = untar_data(URLs.IMDB)

# Step 1: Create LM DataLoaders (needed for vocabulary)
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

# Step 2: Create classifier DataLoaders with shared vocabulary
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y=parent_label,
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

# Inspect the data
dls_clas.show_batch(max_n=3)

Verifying Vocabulary Alignment

from fastai.text.all import *

path = untar_data(URLs.IMDB)

# Build LM and classifier DataLoaders
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y=parent_label,
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

# Verify vocabularies are identical
assert dls_clas.vocab[0] == dls_lm.vocab, "Vocabularies must match!"

# Check the category vocabulary
print(dls_clas.vocab[1])
# Output: ['neg', 'pos']  (or ['pos', 'neg'] depending on sort order)

Inspecting Batch Structure

from fastai.text.all import *

path = untar_data(URLs.IMDB)

# Assume dls_lm is already created
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y=parent_label,
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

# Get a batch
x, y = dls_clas.one_batch()
print(f"Text batch shape: {x.shape}")
# Output: torch.Size([128, variable_length])

print(f"Label batch shape: {y.shape}")
# Output: torch.Size([128])

print(f"Unique labels: {y.unique()}")
# Output: tensor([0, 1])

# Decode a sample
print(f"Text: {dls_clas.decode((x, y))[0][0][:100]}...")
print(f"Label: {dls_clas.decode((x, y))[1][0]}")

Related Pages

Implements Principle

Principle:Fastai_Fastbook_Classifier_Data_Preparation

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment