Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Fastai Fastbook Language Model Fine Tuning

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Transfer Learning, Language Modeling
Last Updated 2026-02-09 17:00 GMT

Overview

Language model fine-tuning is the process of adapting a general-purpose pretrained language model to the specific vocabulary, style, and domain of the target corpus before using its learned representations for downstream tasks.

Description

Language model fine-tuning is the second stage of the ULMFiT three-stage transfer learning approach (Howard & Ruder, 2018):

  1. Stage 1 - General-domain LM pretraining: A language model is pretrained on a large general corpus (e.g., Wikitext-103, containing 103 million tokens from Wikipedia). This gives the model broad knowledge of language structure, grammar, and general world knowledge. This stage is done once and the weights are distributed as a pretrained model.
  2. Stage 2 - Target-domain LM fine-tuning: The pretrained language model is fine-tuned on the target domain corpus (e.g., IMDb movie reviews). This adapts the model to the specific vocabulary, writing style, and topical distribution of the target task. This is what this principle covers.
  3. Stage 3 - Classifier training: A classification head is added on top of the fine-tuned language model encoder, and the full model is trained on the labeled classification task.

The key insight of ULMFiT is that Stage 2 is critical for achieving good downstream performance. Without domain-specific fine-tuning, the pretrained model's representations may not capture domain-specific patterns (e.g., how movie reviewers express sentiment differently from Wikipedia authors).

Usage

Use language model fine-tuning when:

  • Applying transfer learning to any NLP text classification task.
  • The target domain differs substantially from the pretraining corpus (e.g., medical text, legal documents, social media).
  • You want to leverage the full ULMFiT pipeline for maximum classification accuracy.
  • You have a significant amount of unlabeled text in the target domain (even if labeled data is scarce).

Theoretical Basis

Architecture: AWD-LSTM

The AWD-LSTM (ASGD Weight-Dropped LSTM) architecture (Merity et al., 2017) is the backbone of the ULMFiT language model. It consists of:

  • Embedding layer: Maps token indices to dense vectors of dimension 400.
  • 3-layer LSTM: Each layer has 1,150 hidden units. The layers use various dropout strategies for regularization.
  • Output projection: A linear layer that maps the final hidden state back to the vocabulary space, producing logits for next-token prediction.

The AWD-LSTM applies five distinct types of dropout for regularization:

Dropout Type Where Applied Purpose
Embedding dropout Input embeddings Zeros out entire word embeddings randomly during training
Input dropout Between embedding and first LSTM layer Standard dropout on the input to the LSTM
Weight dropout Recurrent weight matrices (hidden-to-hidden) Prevents co-adaptation of recurrent connections
Hidden dropout Between LSTM layers Standard dropout between stacked LSTM layers
Output dropout After the final LSTM layer Regularizes the final representation before projection

The drop_mult parameter scales all five dropout rates simultaneously. A value of drop_mult=0.3 reduces all default dropout rates to 30% of their base values.

Fine-tuning Techniques

ULMFiT introduces two key techniques for stable fine-tuning:

Discriminative Learning Rates

Different layers of the model are assigned different learning rates. Earlier layers (which capture general features) receive lower learning rates, while later layers (which capture task-specific features) receive higher learning rates:

lr_layer_n = lr_base
lr_layer_(n-1) = lr_base / 2.6
lr_layer_(n-2) = lr_base / 2.6^2
...

This prevents catastrophic forgetting of the general knowledge learned during pretraining.

Slanted Triangular Learning Rates

The learning rate follows a schedule that:

  1. Quickly increases linearly from a very small value to the peak (lr_max) during the first ~10% of training.
  2. Slowly decreases linearly back to near zero over the remaining ~90%.

This is implemented in fastai as the fit_one_cycle method.

Metrics

  • Accuracy: The fraction of next-token predictions that match the actual next token. A well-fine-tuned language model on IMDb typically achieves ~35-40% accuracy (which is high for next-token prediction over a 60,000-token vocabulary).
  • Perplexity: Defined as exp(cross_entropy_loss). Lower perplexity indicates better language modeling. A perplexity of 20 means the model is, on average, as uncertain as if choosing uniformly among 20 tokens.

Encoder Saving

After fine-tuning, only the encoder portion (embedding + LSTM layers) is saved, not the output projection layer. The encoder captures the learned representations and will be loaded into the classifier in Stage 3. The output projection is discarded because the classifier uses a different head architecture.

FUNCTION fine_tune_language_model(pretrained_model, domain_data, epochs):
    model = load_pretrained(pretrained_model)

    # Fine-tune with discriminative LR and one-cycle schedule
    FOR epoch IN 1..epochs:
        FOR batch IN domain_data:
            x, y = batch
            predictions = model(x)
            loss = cross_entropy(predictions, y)
            update_weights(loss, discriminative_lr)

    # Save only the encoder (discard output projection)
    encoder = model.layers[:-1]  # Everything except the final linear layer
    save(encoder, "fine_tuned_encoder")

    RETURN encoder

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment