Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Fastai Fastbook Text Classifier DataLoaders

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Text Classification
Last Updated 2026-02-09 17:00 GMT

Overview

Concrete tool for creating classification-ready DataLoaders that pair tokenized text documents with category labels while preserving vocabulary alignment with a pretrained language model, provided by the fastai library.

Description

The classifier DataLoaders are built using the fastai DataBlock API with specific configuration for text classification:

  • TextBlock.from_folder(path, vocab=dls_lm.vocab): Creates a text processing block that tokenizes and numericalizes text files, but critically uses the vocabulary from the language model DataLoaders rather than building a new one. The vocab parameter is the key mechanism for ensuring vocabulary consistency.
  • CategoryBlock: Creates a categorical target block that maps label strings ("pos", "neg") to integer indices.
  • get_y=parent_label: A function that extracts the label from the parent directory name of each text file.
  • splitter=GrandparentSplitter(valid_name='test'): Splits data into train and validation sets based on the grandparent directory name, using "test" as the validation set identifier.

The resulting DataLoaders sort documents by length within each batch and apply padding to handle variable-length sequences efficiently.

Usage

Use this DataBlock configuration after completing language model fine-tuning. The dls_lm.vocab must be available from the language model training step. This creates the data pipeline for the final classification training stage.

Code Reference

Source Location

  • Repository: fastbook
  • File: translations/cn/10_nlp.md (lines 656-661)
  • Library module: fastai.text.data

Signature

dls_clas = DataBlock(
    blocks=(
        TextBlock.from_folder(path, vocab=dls_lm.vocab),
        CategoryBlock
    ),
    get_y=parent_label,
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

Import

from fastai.text.all import (
    TextBlock, CategoryBlock, DataBlock,
    GrandparentSplitter, parent_label
)

I/O Contract

Inputs

Name Type Required Description
path Path Yes Root path of the dataset directory (e.g., the IMDb dataset root containing train/ and test/ subdirectories).
vocab list of str Yes Vocabulary from the language model DataLoaders. Passed as dls_lm.vocab to ensure token-to-index alignment with the pretrained encoder.
get_y callable Yes Function to extract labels from file paths. parent_label returns the parent directory name as the label string.
splitter callable Yes Splitting strategy. GrandparentSplitter(valid_name='test') uses the grandparent directory to assign files to train or validation.
bs int No Batch size. Default: 128. May need to be reduced for GPU memory constraints since variable-length documents require more memory than fixed-length LM batches.
seq_len int No Maximum sequence length for gradual sequence length increase during training. Default: 72.

Outputs

Name Type Description
dls_clas DataLoaders A DataLoaders object with train and validation loaders. Training loader yields (text_tensor, label_tensor) pairs.
batch (x) TensorText Padded text tensor of shape (bs, max_seq_len_in_batch). Each row is a numericalized document, padded with the xxpad token index.
batch (y) TensorCategory Label tensor of shape (bs,) with integer category indices (0 for neg, 1 for pos, or vice versa).
dls_clas.vocab list of str The shared vocabulary (same as input vocab).

Usage Examples

Basic Usage

from fastai.text.all import *

path = untar_data(URLs.IMDB)

# Step 1: Create LM DataLoaders (needed for vocabulary)
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

# Step 2: Create classifier DataLoaders with shared vocabulary
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y=parent_label,
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

# Inspect the data
dls_clas.show_batch(max_n=3)

Verifying Vocabulary Alignment

from fastai.text.all import *

path = untar_data(URLs.IMDB)

# Build LM and classifier DataLoaders
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_text_files,
    splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y=parent_label,
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

# Verify vocabularies are identical
assert dls_clas.vocab[0] == dls_lm.vocab, "Vocabularies must match!"

# Check the category vocabulary
print(dls_clas.vocab[1])
# Output: ['neg', 'pos']  (or ['pos', 'neg'] depending on sort order)

Inspecting Batch Structure

from fastai.text.all import *

path = untar_data(URLs.IMDB)

# Assume dls_lm is already created
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
    get_y=parent_label,
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

# Get a batch
x, y = dls_clas.one_batch()
print(f"Text batch shape: {x.shape}")
# Output: torch.Size([128, variable_length])

print(f"Label batch shape: {y.shape}")
# Output: torch.Size([128])

print(f"Unique labels: {y.unique()}")
# Output: tensor([0, 1])

# Decode a sample
print(f"Text: {dls_clas.decode((x, y))[0][0][:100]}...")
print(f"Label: {dls_clas.decode((x, y))[1][0]}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment