Implementation:Fastai Fastbook Text Classifier DataLoaders
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Text Classification |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Concrete tool for creating classification-ready DataLoaders that pair tokenized text documents with category labels while preserving vocabulary alignment with a pretrained language model, provided by the fastai library.
Description
The classifier DataLoaders are built using the fastai DataBlock API with specific configuration for text classification:
- TextBlock.from_folder(path, vocab=dls_lm.vocab): Creates a text processing block that tokenizes and numericalizes text files, but critically uses the vocabulary from the language model DataLoaders rather than building a new one. The vocab parameter is the key mechanism for ensuring vocabulary consistency.
- CategoryBlock: Creates a categorical target block that maps label strings ("pos", "neg") to integer indices.
- get_y=parent_label: A function that extracts the label from the parent directory name of each text file.
- splitter=GrandparentSplitter(valid_name='test'): Splits data into train and validation sets based on the grandparent directory name, using "test" as the validation set identifier.
The resulting DataLoaders sort documents by length within each batch and apply padding to handle variable-length sequences efficiently.
Usage
Use this DataBlock configuration after completing language model fine-tuning. The dls_lm.vocab must be available from the language model training step. This creates the data pipeline for the final classification training stage.
Code Reference
Source Location
- Repository: fastbook
- File: translations/cn/10_nlp.md (lines 656-661)
- Library module: fastai.text.data
Signature
dls_clas = DataBlock(
blocks=(
TextBlock.from_folder(path, vocab=dls_lm.vocab),
CategoryBlock
),
get_y=parent_label,
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)
Import
from fastai.text.all import (
TextBlock, CategoryBlock, DataBlock,
GrandparentSplitter, parent_label
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | Path | Yes | Root path of the dataset directory (e.g., the IMDb dataset root containing train/ and test/ subdirectories). |
| vocab | list of str | Yes | Vocabulary from the language model DataLoaders. Passed as dls_lm.vocab to ensure token-to-index alignment with the pretrained encoder. |
| get_y | callable | Yes | Function to extract labels from file paths. parent_label returns the parent directory name as the label string. |
| splitter | callable | Yes | Splitting strategy. GrandparentSplitter(valid_name='test') uses the grandparent directory to assign files to train or validation. |
| bs | int | No | Batch size. Default: 128. May need to be reduced for GPU memory constraints since variable-length documents require more memory than fixed-length LM batches. |
| seq_len | int | No | Maximum sequence length for gradual sequence length increase during training. Default: 72. |
Outputs
| Name | Type | Description |
|---|---|---|
| dls_clas | DataLoaders | A DataLoaders object with train and validation loaders. Training loader yields (text_tensor, label_tensor) pairs. |
| batch (x) | TensorText | Padded text tensor of shape (bs, max_seq_len_in_batch). Each row is a numericalized document, padded with the xxpad token index. |
| batch (y) | TensorCategory | Label tensor of shape (bs,) with integer category indices (0 for neg, 1 for pos, or vice versa). |
| dls_clas.vocab | list of str | The shared vocabulary (same as input vocab). |
Usage Examples
Basic Usage
from fastai.text.all import *
path = untar_data(URLs.IMDB)
# Step 1: Create LM DataLoaders (needed for vocabulary)
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_text_files,
splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
# Step 2: Create classifier DataLoaders with shared vocabulary
dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
get_y=parent_label,
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)
# Inspect the data
dls_clas.show_batch(max_n=3)
Verifying Vocabulary Alignment
from fastai.text.all import *
path = untar_data(URLs.IMDB)
# Build LM and classifier DataLoaders
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_text_files,
splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
get_y=parent_label,
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)
# Verify vocabularies are identical
assert dls_clas.vocab[0] == dls_lm.vocab, "Vocabularies must match!"
# Check the category vocabulary
print(dls_clas.vocab[1])
# Output: ['neg', 'pos'] (or ['pos', 'neg'] depending on sort order)
Inspecting Batch Structure
from fastai.text.all import *
path = untar_data(URLs.IMDB)
# Assume dls_lm is already created
dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab), CategoryBlock),
get_y=parent_label,
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)
# Get a batch
x, y = dls_clas.one_batch()
print(f"Text batch shape: {x.shape}")
# Output: torch.Size([128, variable_length])
print(f"Label batch shape: {y.shape}")
# Output: torch.Size([128])
print(f"Unique labels: {y.unique()}")
# Output: tensor([0, 1])
# Decode a sample
print(f"Text: {dls_clas.decode((x, y))[0][0][:100]}...")
print(f"Label: {dls_clas.decode((x, y))[1][0]}")