Heuristic:Fastai Fastbook Dropout Regularization

Knowledge Sources	fastai/fastbook Merity AWD-LSTM
Domains	Regularization, NLP
Last Updated	2026-02-09 17:00 GMT

Overview

Use `drop_mult` to scale dropout rates: `drop_mult=0.3` for language models, `drop_mult=0.5` for classifiers to prevent overfitting.

Description

Dropout is a regularization technique that randomly zeroes activations during training with probability `p`, then rescales remaining activations by `1/(1-p)` to maintain expected values. In fastai, the `drop_mult` parameter is a global multiplier that scales all dropout rates in a model simultaneously, making it easy to increase or decrease the overall regularization level. Lower `drop_mult` values mean less dropout (less regularization), while higher values increase dropout.

The AWD-LSTM architecture used for NLP in Fastbook applies dropout at multiple points: embedding dropout, input dropout, weight dropout, and hidden dropout. The `drop_mult` parameter scales all of these together.

Usage

Use this heuristic when:

Training language models: Set `drop_mult=0.3` for domain-specific LM fine-tuning
Training text classifiers: Use `drop_mult=0.5` (default) for classifier training
Overfitting observed: Increase `drop_mult` if validation loss rises while training loss falls
Underfitting observed: Decrease `drop_mult` if the model struggles to learn

The Insight (Rule of Thumb)

Action: Set `drop_mult=` parameter when creating a learner with `language_model_learner` or `text_classifier_learner`.
Value:
- Language model fine-tuning: `drop_mult=0.3` (lower dropout since we want the model to memorize domain patterns)
- Text classification: `drop_mult=0.5` (default, moderate regularization)
Trade-off: Too much dropout slows convergence and can prevent learning. Too little dropout allows overfitting. The `drop_mult` multiplier provides a single knob to tune.

Reasoning

Geoffrey Hinton inspired dropout with a banking analogy: bank employees rotate frequently so no group can conspire to commit fraud; similarly, randomly dropping neurons prevents co-adaptation where neurons rely too heavily on each other. The Merity et al. AWD-LSTM paper demonstrated that effective use of dropout at multiple points (embedding, input, weight, hidden) allows a simple LSTM to achieve state-of-the-art results that previously required more complex architectures.

For language model fine-tuning, lower dropout (`drop_mult=0.3`) is appropriate because the model needs to adapt to domain-specific vocabulary and patterns. For classification, moderate dropout (`drop_mult=0.5`) helps prevent overfitting to the training labels.

Code Evidence

Language model with `drop_mult=0.3` from `10_nlp.md:551-553`:

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]).to_fp16()

Dropout implementation from `12_nlp_dive.md:750-756`:

class Dropout(Module):
    def __init__(self, p): self.p = p
    def forward(self, x):
        if not self.training: return x
        mask = x.new(*x.shape).bernoulli_(1-p)
        return x * mask.div_(1-p)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment