Principle:FMInference FlexLLMGen Learning Rate Scheduling

Field	Value
Sources	Upstream: DeepSpeed, Paper: FlexGen
Domains	Training_Optimization, Learning_Rate_Scheduling
Last Updated	2026-02-09 00:00 GMT

Overview

A training optimization strategy that systematically varies the learning rate during training according to predefined schedules, improving convergence speed and final model quality.

Description

Learning rate scheduling dynamically adjusts the learning rate throughout training rather than using a fixed value. This is motivated by the observation that different training phases benefit from different learning rates: early training benefits from larger rates for rapid progress, while later training benefits from smaller rates for fine-grained convergence.

The principal scheduling strategies include:

Warmup -- Gradually increasing the learning rate from a small initial value to the target value over a fixed number of steps. This prevents early training instability, especially important for large batch sizes and adaptive optimizers like Adam. The warmup curve can be linear (constant rate of increase) or logarithmic (faster initial increase, slower approach to target).

Decay -- Reducing the learning rate after warmup, typically linearly, to allow the optimizer to settle into a sharp minimum. The WarmupDecayLR schedule combines linear warmup with linear decay over the remaining training steps.

OneCycle -- A two-phase schedule (ascending then descending) that follows the insight from Leslie Smith's 1Cycle policy. The learning rate first increases from a minimum to a maximum, then decreases back. This can be combined with inverse momentum cycling (high LR + low momentum, low LR + high momentum) for faster convergence.

Range test -- A diagnostic schedule that continuously increases the learning rate to find the optimal LR range, identified by the point where loss begins to diverge.

The key principle is that learning rate scheduling is complementary to the optimizer choice: the optimizer determines the update direction and magnitude given a learning rate, while the schedule determines how that rate evolves over time.

Usage

Learning rate scheduling is essential for virtually all large-scale model training. The warmup phase is especially critical when training with large batch sizes (as is common in distributed training) or with mixed precision, where early training gradients may be noisy.

Theoretical Basis

The theoretical motivation for learning rate schedules comes from the optimization landscape perspective. Early in training, the loss surface has large gradients and the optimizer needs a large learning rate to make rapid progress. As training progresses and the model approaches a minimum, the gradients become smaller and more noisy, and a smaller learning rate prevents oscillation around the minimum. The warmup phase specifically addresses the instability that arises from the interaction between large batch sizes and adaptive optimizers, where the variance estimates in Adam/AdamW are inaccurate at the start of training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment