Principle:Fastai Fastbook Learning Rate Selection
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Optimization, Computer_Vision |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Learning rate selection is the process of empirically determining the optimal step size for gradient descent before committing to a full training run.
Description
The learning rate is the single most important hyperparameter in neural network training. It controls how much the model weights are adjusted in response to the computed gradient at each optimization step:
weight_new = weight_old - learning_rate * gradient
If the learning rate is too high, the optimizer overshoots the loss minimum and training diverges (loss explodes). If too low, training converges extremely slowly or gets stuck in a poor local minimum. Finding the right value is critical and was historically done by trial and error.
The learning rate finder technique, proposed by Leslie Smith (2015), automates this search. It runs a short mock training session where the learning rate increases exponentially from a very small value to a very large value over a fixed number of iterations. The loss is recorded at each step. The resulting loss-vs-learning-rate plot reveals the optimal range.
Usage
Run the learning rate finder once after creating a Learner and before calling any training method. It is especially important after:
- Creating a new Learner for the first time
- Unfreezing the model body (the optimal rate changes when more parameters are trainable)
- Changing the dataset significantly (different data distribution may shift the optimal rate)
Theoretical Basis
The LR Finder Algorithm
The learning rate finder follows this procedure:
- Set learning rate to a very small value (e.g., 1e-7).
- Train for one mini-batch and record the loss.
- Multiply the learning rate by a constant factor (e.g., 1.3).
- Repeat steps 2-3 for a fixed number of iterations (e.g., 100).
- Plot loss vs. log(learning_rate).
Interpreting the Plot
The loss-vs-learning-rate curve has a characteristic shape:
| Region | Learning Rate Range | Loss Behavior | Interpretation |
|---|---|---|---|
| Too low | < 1e-4 (typical) | Flat or very slowly decreasing | Learning is too slow; gradients barely move the weights |
| Sweet spot | ~1e-3 to ~1e-2 (typical) | Steeply decreasing | Optimal range; fast convergence without instability |
| Too high | > 1e-1 (typical) | Sharply increasing or diverging | Optimizer overshoots; weights oscillate wildly |
Selection Heuristics
Two common heuristics for selecting the learning rate from the plot:
- One order of magnitude before the minimum: Find the learning rate where the loss is lowest, then divide by 10. This provides a safety margin below the instability threshold.
- Steepest descent: Find the learning rate at the point of steepest negative slope. This maximizes the rate of loss decrease.
The fastai lr_find method returns both values as lr_min (the valley minimum divided by 10) and lr_steep (the point of maximum negative gradient).
Mathematical Basis
The exponential schedule used by the finder can be expressed as:
lr_i = start_lr * (end_lr / start_lr) ^ (i / num_iterations)
where i is the current iteration. This ensures uniform spacing on a logarithmic scale, which is appropriate because the optimal learning rate often varies over several orders of magnitude.