Heuristic:Fastai Fastbook Discriminative Learning Rates
| Knowledge Sources | |
|---|---|
| Domains | Transfer_Learning, Optimization |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Use Python `slice(low_lr, high_lr)` to apply lower learning rates to early pretrained layers and higher rates to later layers.
Description
Discriminative learning rates assign different learning rates to different layer groups of a neural network during transfer learning. Early layers (which learn basic features like edges and textures) should be updated slowly to preserve pretrained knowledge, while later layers (which learn task-specific features) need higher learning rates to adapt quickly. This technique was developed as part of the ULMFiT approach and is a default strategy in fastai.
The fastai library accepts a Python `slice` object wherever a learning rate is expected. The first value becomes the LR for the earliest layers; the second value becomes the LR for the final layers. Intermediate layers receive multiplicatively spaced learning rates between these bounds.
Usage
Use this heuristic after unfreezing a pretrained model for fine-tuning. The standard workflow is:
- Train with frozen pretrained layers (only random head trains)
- Unfreeze all layers with `learn.unfreeze()`
- Use discriminative LRs: `learn.fit_one_cycle(epochs, lr_max=slice(low_lr, high_lr))`
For NLP, use gradual unfreezing combined with discriminative LRs: unfreeze one layer group at a time with `freeze_to(-2)`, `freeze_to(-3)`, then full `unfreeze()`.
The Insight (Rule of Thumb)
- Action: Pass `slice(low_lr, high_lr)` to `fit_one_cycle` after unfreezing.
- Value:
- Vision: `slice(1e-6, 1e-4)` — two orders of magnitude range
- NLP: `slice(lr/(2.6**4), lr)` — each layer group gets LR divided by 2.6
- Trade-off: Using uniform LR after unfreezing risks destroying pretrained features in early layers. Discriminative LRs preserve these features while allowing later layers to adapt.
Reasoning
Jason Yosinski et al. (2014) showed that in transfer learning, different neural network layers should be trained at different speeds because early layers capture universal features (edges, textures) while later layers capture task-specific features. Using a uniformly high learning rate would destroy the valuable pretrained features in early layers, while using a uniformly low learning rate would slow down adaptation of later layers.
The NLP scaling factor of 2.6 between layer groups was empirically determined in the ULMFiT paper and has proven robust across different text classification tasks.
Code Evidence
Vision discriminative LRs from `05_pet_breeds.md:757-761`:
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fit_one_cycle(3, 3e-3)
learn.unfreeze()
learn.fit_one_cycle(12, lr_max=slice(1e-6,1e-4))
NLP gradual unfreezing with discriminative LRs from `10_nlp.md:725-761`:
# Layer-by-layer unfreezing with discriminative LRs
learn.fit_one_cycle(1, 2e-2) # frozen
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2)) # last 2 groups
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3)) # last 3 groups
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3)) # all layers