Principle:Fastai Fastbook Classifier Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Text Classification, Data Engineering |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Classifier data preparation is the process of transforming labeled text documents into batched, numericalized sequences paired with category labels, while ensuring vocabulary consistency with the previously fine-tuned language model.
Description
The transition from language modeling to classification introduces several critical changes in how data is prepared:
- Vocabulary sharing: The classifier must use the exact same vocabulary as the fine-tuned language model. If a different vocabulary were used, the token-to-index mapping would be inconsistent, and the pretrained encoder weights would be meaningless. This is the most important constraint in the entire pipeline.
- Label assignment: Unlike language modeling where targets are the next token, classification requires an external label for each document. In the IMDb dataset, labels are inferred from the directory structure: documents in pos/ folders get the "pos" label, and documents in neg/ folders get the "neg" label.
- Train/test splitting: The classification split follows the dataset's intended structure: documents under train/ are training data, documents under test/ are validation data. This is handled by a GrandparentSplitter that examines the grandparent directory name.
- Variable-length handling: Unlike language model data (which is one contiguous stream), classification data consists of variable-length documents that must be batched together. The data loader sorts documents by length and pads shorter documents within each batch to minimize wasted computation.
Usage
Use classifier data preparation when:
- Transitioning from the language model fine-tuning stage to the classification stage of the ULMFiT pipeline.
- You have a vocabulary from a previously trained language model that must be reused.
- Your dataset follows a directory structure where labels can be inferred from folder names.
- You need efficient batching of variable-length text documents.
Theoretical Basis
Vocabulary Consistency Requirement
The vocabulary sharing constraint is fundamental to transfer learning in NLP:
Given:
LM vocabulary V_lm = {t_0: "xxunk", t_1: "xxpad", ..., t_k: "the", ...}
LM encoder weights W_lm trained with V_lm
The classifier MUST use:
Classifier vocabulary V_cls = V_lm (identical mapping)
If V_cls != V_lm:
The embedding for token "the" at index k in W_lm would be
applied to a DIFFERENT token at index k in V_cls.
This destroys all pretrained knowledge.
Batching Strategy for Classification
Classification batching differs from language model batching because each document is an independent sample with a fixed label:
FUNCTION prepare_classification_batches(documents, labels, bs, seq_len):
# Step 1: Sort documents by length (descending)
sorted_pairs = sort_by_length(zip(documents, labels), key=doc_length)
# Step 2: Group into batches of size bs
batches = chunk(sorted_pairs, bs)
# Step 3: For each batch, pad to the length of the longest document
FOR EACH batch IN batches:
max_len = max(len(doc) FOR (doc, label) IN batch)
padded_docs = []
batch_labels = []
FOR EACH (doc, label) IN batch:
padded = pad_to_length(doc, max_len, pad_token=xxpad_index)
padded_docs.append(padded)
batch_labels.append(label)
x = tensor(padded_docs) # shape: (bs, max_len)
y = tensor(batch_labels) # shape: (bs,)
YIELD (x, y)
Why sort by length? Sorting documents by length ensures that documents within the same batch have similar lengths. This minimizes the amount of padding needed, reducing wasted computation. Without sorting, a batch might contain both a 20-token review and a 2,000-token review, requiring 1,980 padding tokens for the short review.
Label Inference from Directory Structure
The IMDb dataset uses a naming convention where the parent directory encodes the label:
path/train/pos/review_001.txt -> label = "pos"
path/train/neg/review_002.txt -> label = "neg"
path/test/pos/review_003.txt -> label = "pos"
path/test/neg/review_004.txt -> label = "neg"
parent_label(file_path) = file_path.parent.name
# For path/train/pos/review_001.txt: returns "pos"
GrandparentSplitter(valid_name='test'):
# Assigns files under path/train/... to training set
# Assigns files under path/test/... to validation set
# The grandparent of pos/review.txt within train/ is "train"
Differences from Language Model Data
| Aspect | Language Model Data | Classifier Data |
|---|---|---|
| Target | Next token (shifted input) | Category label (pos/neg) |
| Sample independence | Samples are contiguous; hidden state carries over | Each sample is independent |
| Batching | Reshaped stream with fixed seq_len | Variable-length docs, sorted and padded |
| Vocabulary | Built from corpus | Inherited from LM (must match exactly) |
| Data used | All text (labeled + unlabeled) | Only labeled text |
| Splitting | Random 90/10 split | Train/test from directory structure |