Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Fastai Fastbook Classifier Data Preparation

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Text Classification, Data Engineering
Last Updated 2026-02-09 17:00 GMT

Overview

Classifier data preparation is the process of transforming labeled text documents into batched, numericalized sequences paired with category labels, while ensuring vocabulary consistency with the previously fine-tuned language model.

Description

The transition from language modeling to classification introduces several critical changes in how data is prepared:

  1. Vocabulary sharing: The classifier must use the exact same vocabulary as the fine-tuned language model. If a different vocabulary were used, the token-to-index mapping would be inconsistent, and the pretrained encoder weights would be meaningless. This is the most important constraint in the entire pipeline.
  2. Label assignment: Unlike language modeling where targets are the next token, classification requires an external label for each document. In the IMDb dataset, labels are inferred from the directory structure: documents in pos/ folders get the "pos" label, and documents in neg/ folders get the "neg" label.
  3. Train/test splitting: The classification split follows the dataset's intended structure: documents under train/ are training data, documents under test/ are validation data. This is handled by a GrandparentSplitter that examines the grandparent directory name.
  4. Variable-length handling: Unlike language model data (which is one contiguous stream), classification data consists of variable-length documents that must be batched together. The data loader sorts documents by length and pads shorter documents within each batch to minimize wasted computation.

Usage

Use classifier data preparation when:

  • Transitioning from the language model fine-tuning stage to the classification stage of the ULMFiT pipeline.
  • You have a vocabulary from a previously trained language model that must be reused.
  • Your dataset follows a directory structure where labels can be inferred from folder names.
  • You need efficient batching of variable-length text documents.

Theoretical Basis

Vocabulary Consistency Requirement

The vocabulary sharing constraint is fundamental to transfer learning in NLP:

Given:
  LM vocabulary V_lm = {t_0: "xxunk", t_1: "xxpad", ..., t_k: "the", ...}
  LM encoder weights W_lm trained with V_lm

The classifier MUST use:
  Classifier vocabulary V_cls = V_lm  (identical mapping)

If V_cls != V_lm:
  The embedding for token "the" at index k in W_lm would be
  applied to a DIFFERENT token at index k in V_cls.
  This destroys all pretrained knowledge.

Batching Strategy for Classification

Classification batching differs from language model batching because each document is an independent sample with a fixed label:

FUNCTION prepare_classification_batches(documents, labels, bs, seq_len):
    # Step 1: Sort documents by length (descending)
    sorted_pairs = sort_by_length(zip(documents, labels), key=doc_length)

    # Step 2: Group into batches of size bs
    batches = chunk(sorted_pairs, bs)

    # Step 3: For each batch, pad to the length of the longest document
    FOR EACH batch IN batches:
        max_len = max(len(doc) FOR (doc, label) IN batch)
        padded_docs = []
        batch_labels = []
        FOR EACH (doc, label) IN batch:
            padded = pad_to_length(doc, max_len, pad_token=xxpad_index)
            padded_docs.append(padded)
            batch_labels.append(label)

        x = tensor(padded_docs)    # shape: (bs, max_len)
        y = tensor(batch_labels)   # shape: (bs,)
        YIELD (x, y)

Why sort by length? Sorting documents by length ensures that documents within the same batch have similar lengths. This minimizes the amount of padding needed, reducing wasted computation. Without sorting, a batch might contain both a 20-token review and a 2,000-token review, requiring 1,980 padding tokens for the short review.

Label Inference from Directory Structure

The IMDb dataset uses a naming convention where the parent directory encodes the label:

path/train/pos/review_001.txt  -> label = "pos"
path/train/neg/review_002.txt  -> label = "neg"
path/test/pos/review_003.txt   -> label = "pos"
path/test/neg/review_004.txt   -> label = "neg"

parent_label(file_path) = file_path.parent.name
    # For path/train/pos/review_001.txt: returns "pos"

GrandparentSplitter(valid_name='test'):
    # Assigns files under path/train/... to training set
    # Assigns files under path/test/... to validation set
    # The grandparent of pos/review.txt within train/ is "train"

Differences from Language Model Data

Aspect Language Model Data Classifier Data
Target Next token (shifted input) Category label (pos/neg)
Sample independence Samples are contiguous; hidden state carries over Each sample is independent
Batching Reshaped stream with fixed seq_len Variable-length docs, sorted and padded
Vocabulary Built from corpus Inherited from LM (must match exactly)
Data used All text (labeled + unlabeled) Only labeled text
Splitting Random 90/10 split Train/test from directory structure

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment