Heuristic:Fastai Fastbook Weight Decay Tuning

Knowledge Sources	fastai/fastbook
Domains	Optimization, Regularization
Last Updated	2026-02-09 17:00 GMT

Overview

Weight decay (`wd`) regularization heuristic: use `wd=0.1` for collaborative filtering and NLP, rely on fastai defaults elsewhere.

Description

Weight decay (L2 regularization) penalizes large parameter values by adding a term proportional to the sum of squared weights to the loss function. In practice, rather than computing the expensive sum `wd * (parameters**2).sum()`, the gradient is directly modified: `parameters.grad += wd * 2 * parameters`. The fastai library simplifies this further (absorbing the factor of 2 into the `wd` parameter), so users just pass `wd=` to `fit_one_cycle`.

Usage

Use this heuristic when training models that show signs of overfitting (validation loss increasing while training loss decreases). Particularly important for:

Collaborative filtering: Embedding-based models tend to overfit on sparse user-item matrices
NLP language models: Large vocabularies and sequential dependencies encourage overfitting
Any model where you observe a growing gap between training and validation metrics

The Insight (Rule of Thumb)

Action: Pass `wd=` parameter to `fit_one_cycle` or `fit`.
Value: `wd=0.1` for collaborative filtering and NLP models. fastai's default weight decay works well for vision models.
Trade-off: Higher weight decay constrains model capacity (prevents large weights) which can reduce overfitting but may underfit if set too high.

Reasoning

Weight decay is mathematically equivalent to adding `wd * (parameters**2).sum()` to the loss, but is implemented as a direct gradient modification for efficiency and numerical stability. The value `wd=0.1` was shown in the Fastbook collaborative filtering chapter to significantly improve validation loss (from overfitting to stable convergence). The key principle is that constraining weight magnitudes forces the model to learn more generalizable representations rather than memorizing the training set.

Code Evidence

Weight decay explanation and usage from `08_collab.md:371-388`:

# Theory: loss_with_wd = loss + wd * (parameters**2).sum()
# Practice: parameters.grad += wd * 2 * parameters

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

NLP language model with weight decay from `12_nlp_dive.md:836`:

learn.fit_one_cycle(15, 1e-2, wd=0.1)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment