Heuristic:Hpcaitech ColossalAI Warmup Steps Heuristic
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Training_Configuration |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
Use 2.5% of total training steps as warmup, cosine-anneal the learning rate down to 10% of its peak value, and pair with AdamW using betas `(0.9, 0.95)`.
Description
ColossalAI's default training recipe for large language models (e.g., Colossal-LLaMA) auto-computes warmup steps as 2.5% of total training steps when no explicit value is provided. The learning rate follows a cosine annealing schedule with a minimum value of 10% of the peak learning rate (`eta_min=0.1 * lr`). The optimizer is HybridAdam in AdamW mode with betas `(0.9, 0.95)`, which is the standard configuration for LLM pre-training and continual pre-training.
Usage
Apply these defaults when fine-tuning or continually pre-training LLMs with ColossalAI. Override `warmup_steps` explicitly if the dataset size or training duration requires a different warmup period.
The Insight (Rule of Thumb)
- Warmup: Default to 2.5% of total training steps (i.e., `num_epochs * 0.025 * (len(dataloader) // accumulation_steps)`).
- Schedule: Cosine annealing from peak LR down to 10% of peak LR (`eta_min=0.1 * lr`).
- Optimizer: AdamW with `betas=(0.9, 0.95)` and `adamw_mode=True`.
- Weight decay: Applied via the AdamW formulation (decoupled from gradient).
Reasoning
A short warmup (2.5%) prevents early training instability caused by large gradient updates on a randomly-initialized or partially-tuned model. The cosine schedule with a 10% floor avoids learning rate collapse to zero, maintaining a small but nonzero learning signal throughout training. The beta values `(0.9, 0.95)` are well-established in LLM training literature (e.g., LLaMA, GPT-3) and provide a good balance between first-moment responsiveness and second-moment stability.
Code Evidence
From `applications/Colossal-LLaMA/train.py:209-212` (optimizer configuration):
optimizer = HybridAdam(
model_params=...,
lr=args.lr,
betas=(0.9, 0.95),
weight_decay=args.weight_decay,
adamw_mode=True,
)
From `applications/Colossal-LLaMA/train.py:215-217` (auto-computed warmup):
if args.warmup_steps is None:
args.warmup_steps = int(args.num_epochs * 0.025 * (len(dataloader) // args.accumulation_steps))
coordinator.print_on_master(f"Warmup steps is set to {args.warmup_steps}")
From `applications/Colossal-LLaMA/train.py:219-224` (cosine annealing schedule):
lr_scheduler = CosineAnnealingWarmupLR(
optimizer=optimizer,
total_steps=args.num_epochs * (len(dataloader) // args.accumulation_steps),
warmup_steps=args.warmup_steps,
eta_min=0.1 * args.lr,
)