Principle:FMInference FlexLLMGen Learning Rate Scheduling
| Field | Value |
|---|---|
| Sources | Upstream: DeepSpeed, Paper: FlexGen |
| Domains | Training_Optimization, Learning_Rate_Scheduling |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A training optimization strategy that systematically varies the learning rate during training according to predefined schedules, improving convergence speed and final model quality.
Description
Learning rate scheduling dynamically adjusts the learning rate throughout training rather than using a fixed value. This is motivated by the observation that different training phases benefit from different learning rates: early training benefits from larger rates for rapid progress, while later training benefits from smaller rates for fine-grained convergence.
The principal scheduling strategies include:
- Warmup -- Gradually increasing the learning rate from a small initial value to the target value over a fixed number of steps. This prevents early training instability, especially important for large batch sizes and adaptive optimizers like Adam. The warmup curve can be linear (constant rate of increase) or logarithmic (faster initial increase, slower approach to target).
- Decay -- Reducing the learning rate after warmup, typically linearly, to allow the optimizer to settle into a sharp minimum. The WarmupDecayLR schedule combines linear warmup with linear decay over the remaining training steps.
- OneCycle -- A two-phase schedule (ascending then descending) that follows the insight from Leslie Smith's 1Cycle policy. The learning rate first increases from a minimum to a maximum, then decreases back. This can be combined with inverse momentum cycling (high LR + low momentum, low LR + high momentum) for faster convergence.
- Range test -- A diagnostic schedule that continuously increases the learning rate to find the optimal LR range, identified by the point where loss begins to diverge.
The key principle is that learning rate scheduling is complementary to the optimizer choice: the optimizer determines the update direction and magnitude given a learning rate, while the schedule determines how that rate evolves over time.
Usage
Learning rate scheduling is essential for virtually all large-scale model training. The warmup phase is especially critical when training with large batch sizes (as is common in distributed training) or with mixed precision, where early training gradients may be noisy.
Theoretical Basis
The theoretical motivation for learning rate schedules comes from the optimization landscape perspective. Early in training, the loss surface has large gradients and the optimizer needs a large learning rate to make rapid progress. As training progresses and the model approaches a minimum, the gradients become smaller and more noisy, and a smaller learning rate prevents oscillation around the minimum. The warmup phase specifically addresses the instability that arises from the interaction between large batch sizes and adaptive optimizers, where the variance estimates in Adam/AdamW are inaccurate at the start of training.