Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Fastai Fastbook NLP Text Classification

From Leeroopedia
Revision as of 11:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Fastai_Fastbook_NLP_Text_Classification.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Deep_Learning, Transfer_Learning
Last Updated 2026-02-09 17:00 GMT

Overview

End-to-end process for text classification using the ULMFiT three-stage transfer learning approach: pretrained language model, domain-specific language model fine-tuning, and classifier training.

Description

This workflow implements the Universal Language Model Fine-Tuning (ULMFiT) method for text classification. It starts with a language model pretrained on a large corpus (Wikipedia), fine-tunes that language model on the target domain's text to learn domain-specific vocabulary and style, and then trains a classifier on top of the fine-tuned language model. The approach uses fastai's text processing pipeline for tokenization (word-level via spaCy with special tokens) and numericalization, and RNN-based architectures (AWD-LSTM) for both the language model and classifier.

Usage

Execute this workflow when you have a text dataset with categorical labels and need to build a classifier (e.g., sentiment analysis on movie reviews, spam detection, document categorization). This approach is particularly effective when labeled data is limited, as the language model fine-tuning stage can leverage unlabeled text from the target domain to improve representations before classification training.

Execution Steps

Step 1: Text Data Preparation

Load the text dataset and organize it for processing. Ensure text documents are accessible along with their labels. For datasets like IMDb, this means loading reviews from train/test/unsup directories. Identify both labeled data (for classification) and any available unlabeled data (for language model fine-tuning).

Key considerations:

  • Include unlabeled text data if available; it improves language model quality
  • Organize data so labels can be extracted (from folder structure, CSV, or dataframe columns)
  • Understand the domain's vocabulary and text style

Step 2: Tokenization

Convert raw text into sequences of tokens using fastai's tokenization pipeline. The default word tokenizer (backed by spaCy) splits text into words and punctuation while handling special cases (contractions, abbreviations, URLs). Fastai then applies additional rules to insert special tokens for stream boundaries (xxbos), capitalization (xxmaj), unknown words (xxunk), and repeated characters.

Key considerations:

  • Word-based tokenization is the default; subword tokenization is an alternative for multilingual or specialized domains
  • Special tokens encode structural information that helps the model learn patterns
  • All text is lowercased with capitalization encoded as special tokens to reduce vocabulary size

Step 3: Numericalization

Build a vocabulary from the tokenized text and convert each token to its integer index. The vocabulary maps each unique token to a number. Tokens below a minimum frequency threshold or beyond a maximum vocabulary size are mapped to the unknown token (xxunk).

Key considerations:

  • Default maximum vocabulary size is 60,000 tokens
  • Minimum frequency threshold filters rare tokens that provide little signal
  • The vocabulary must be consistent between language model and classifier training

Step 4: Language Model DataLoaders

Create DataLoaders specifically designed for language modeling. The LMDataLoader concatenates all documents into one continuous stream and creates sequences where the dependent variable is the input shifted by one token (predicting the next word). It handles shuffling by randomizing the order of concatenated documents while preserving sequence continuity within documents.

Key considerations:

  • Sequence length (bptt) controls how many tokens the model sees at once
  • Shuffling preserves intra-document token order while randomizing document order
  • Both labeled and unlabeled text is used for language model training

Step 5: Language Model Fine_tuning

Load a pretrained AWD-LSTM language model and fine-tune it on the target domain corpus. The pretrained model understands general English from Wikipedia training. Fine-tuning adapts it to the specific vocabulary, style, and patterns of the target domain (e.g., movie review language for IMDb). Use fit_one_cycle with discriminative learning rates.

Key considerations:

  • Start by training only the last layer, then gradually unfreeze deeper layers
  • Use discriminative learning rates: lower rates for pretrained layers, higher for later layers
  • Monitor perplexity (exponential of the loss) as the quality metric
  • Even a modest perplexity improvement translates to better downstream classification

Step 6: Classifier DataLoaders

Create DataLoaders for the classification task using only the labeled portion of the data. The classifier uses the same tokenization and vocabulary as the language model. Padding and sorting by text length ensure efficient batching. The DataBlock specifies TextBlock for inputs and CategoryBlock for labels.

Key considerations:

  • Use the exact same vocabulary from the language model stage
  • Sort documents by length for efficient padding within batches
  • Apply the same tokenization rules as the language model

Step 7: Classifier Training

Create a text_classifier_learner that reuses the encoder (all layers except the final language model head) from the fine-tuned language model. Train the classifier using gradual unfreezing: first train only the new classification head, then progressively unfreeze and fine-tune deeper layers. This careful approach prevents catastrophic forgetting of the learned representations.

Key considerations:

  • Load the encoder weights from the saved language model before training
  • Use gradual unfreezing: train head, unfreeze one more layer group, repeat
  • Apply lower learning rates for earlier layers that contain general knowledge
  • Monitor accuracy on the validation set to select the best model

Execution Diagram

GitHub URL

Workflow Repository