Heuristic:Fastai Fastbook Dropout Regularization
| Knowledge Sources | |
|---|---|
| Domains | Regularization, NLP |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Use `drop_mult` to scale dropout rates: `drop_mult=0.3` for language models, `drop_mult=0.5` for classifiers to prevent overfitting.
Description
Dropout is a regularization technique that randomly zeroes activations during training with probability `p`, then rescales remaining activations by `1/(1-p)` to maintain expected values. In fastai, the `drop_mult` parameter is a global multiplier that scales all dropout rates in a model simultaneously, making it easy to increase or decrease the overall regularization level. Lower `drop_mult` values mean less dropout (less regularization), while higher values increase dropout.
The AWD-LSTM architecture used for NLP in Fastbook applies dropout at multiple points: embedding dropout, input dropout, weight dropout, and hidden dropout. The `drop_mult` parameter scales all of these together.
Usage
Use this heuristic when:
- Training language models: Set `drop_mult=0.3` for domain-specific LM fine-tuning
- Training text classifiers: Use `drop_mult=0.5` (default) for classifier training
- Overfitting observed: Increase `drop_mult` if validation loss rises while training loss falls
- Underfitting observed: Decrease `drop_mult` if the model struggles to learn
The Insight (Rule of Thumb)
- Action: Set `drop_mult=` parameter when creating a learner with `language_model_learner` or `text_classifier_learner`.
- Value:
- Language model fine-tuning: `drop_mult=0.3` (lower dropout since we want the model to memorize domain patterns)
- Text classification: `drop_mult=0.5` (default, moderate regularization)
- Trade-off: Too much dropout slows convergence and can prevent learning. Too little dropout allows overfitting. The `drop_mult` multiplier provides a single knob to tune.
Reasoning
Geoffrey Hinton inspired dropout with a banking analogy: bank employees rotate frequently so no group can conspire to commit fraud; similarly, randomly dropping neurons prevents co-adaptation where neurons rely too heavily on each other. The Merity et al. AWD-LSTM paper demonstrated that effective use of dropout at multiple points (embedding, input, weight, hidden) allows a simple LSTM to achieve state-of-the-art results that previously required more complex architectures.
For language model fine-tuning, lower dropout (`drop_mult=0.3`) is appropriate because the model needs to adapt to domain-specific vocabulary and patterns. For classification, moderate dropout (`drop_mult=0.5`) helps prevent overfitting to the training labels.
Code Evidence
Language model with `drop_mult=0.3` from `10_nlp.md:551-553`:
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()
Dropout implementation from `12_nlp_dive.md:750-756`:
class Dropout(Module):
def __init__(self, p): self.p = p
def forward(self, x):
if not self.training: return x
mask = x.new(*x.shape).bernoulli_(1-p)
return x * mask.div_(1-p)