Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Fastai Fastbook Weight Decay Tuning

From Leeroopedia



Knowledge Sources
Domains Optimization, Regularization
Last Updated 2026-02-09 17:00 GMT

Overview

Weight decay (`wd`) regularization heuristic: use `wd=0.1` for collaborative filtering and NLP, rely on fastai defaults elsewhere.

Description

Weight decay (L2 regularization) penalizes large parameter values by adding a term proportional to the sum of squared weights to the loss function. In practice, rather than computing the expensive sum `wd * (parameters**2).sum()`, the gradient is directly modified: `parameters.grad += wd * 2 * parameters`. The fastai library simplifies this further (absorbing the factor of 2 into the `wd` parameter), so users just pass `wd=` to `fit_one_cycle`.

Usage

Use this heuristic when training models that show signs of overfitting (validation loss increasing while training loss decreases). Particularly important for:

  • Collaborative filtering: Embedding-based models tend to overfit on sparse user-item matrices
  • NLP language models: Large vocabularies and sequential dependencies encourage overfitting
  • Any model where you observe a growing gap between training and validation metrics

The Insight (Rule of Thumb)

  • Action: Pass `wd=` parameter to `fit_one_cycle` or `fit`.
  • Value: `wd=0.1` for collaborative filtering and NLP models. fastai's default weight decay works well for vision models.
  • Trade-off: Higher weight decay constrains model capacity (prevents large weights) which can reduce overfitting but may underfit if set too high.

Reasoning

Weight decay is mathematically equivalent to adding `wd * (parameters**2).sum()` to the loss, but is implemented as a direct gradient modification for efficiency and numerical stability. The value `wd=0.1` was shown in the Fastbook collaborative filtering chapter to significantly improve validation loss (from overfitting to stable convergence). The key principle is that constraining weight magnitudes forces the model to learn more generalizable representations rather than memorizing the training set.

Code Evidence

Weight decay explanation and usage from `08_collab.md:371-388`:

# Theory: loss_with_wd = loss + wd * (parameters**2).sum()
# Practice: parameters.grad += wd * 2 * parameters

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

NLP language model with weight decay from `12_nlp_dive.md:836`:

learn.fit_one_cycle(15, 1e-2, wd=0.1)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment