Heuristic:Hpcaitech ColossalAI Warmup Steps Heuristic

Knowledge Sources	Colossal-LLaMA train.py
Domains	Optimization, Training_Configuration
Last Updated	2026-02-09 03:00 GMT

Overview

Use 2.5% of total training steps as warmup, cosine-anneal the learning rate down to 10% of its peak value, and pair with AdamW using betas `(0.9, 0.95)`.

Description

ColossalAI's default training recipe for large language models (e.g., Colossal-LLaMA) auto-computes warmup steps as 2.5% of total training steps when no explicit value is provided. The learning rate follows a cosine annealing schedule with a minimum value of 10% of the peak learning rate (`eta_min=0.1 * lr`). The optimizer is HybridAdam in AdamW mode with betas `(0.9, 0.95)`, which is the standard configuration for LLM pre-training and continual pre-training.

Usage

Apply these defaults when fine-tuning or continually pre-training LLMs with ColossalAI. Override `warmup_steps` explicitly if the dataset size or training duration requires a different warmup period.

The Insight (Rule of Thumb)

Warmup: Default to 2.5% of total training steps (i.e., `num_epochs * 0.025 * (len(dataloader) // accumulation_steps)`).
Schedule: Cosine annealing from peak LR down to 10% of peak LR (`eta_min=0.1 * lr`).
Optimizer: AdamW with `betas=(0.9, 0.95)` and `adamw_mode=True`.
Weight decay: Applied via the AdamW formulation (decoupled from gradient).

Reasoning

A short warmup (2.5%) prevents early training instability caused by large gradient updates on a randomly-initialized or partially-tuned model. The cosine schedule with a 10% floor avoids learning rate collapse to zero, maintaining a small but nonzero learning signal throughout training. The beta values `(0.9, 0.95)` are well-established in LLM training literature (e.g., LLaMA, GPT-3) and provide a good balance between first-moment responsiveness and second-moment stability.

Code Evidence

From `applications/Colossal-LLaMA/train.py:209-212` (optimizer configuration):

optimizer = HybridAdam(
    model_params=...,
    lr=args.lr,
    betas=(0.9, 0.95),
    weight_decay=args.weight_decay,
    adamw_mode=True,
)

From `applications/Colossal-LLaMA/train.py:215-217` (auto-computed warmup):

if args.warmup_steps is None:
    args.warmup_steps = int(args.num_epochs * 0.025 * (len(dataloader) // args.accumulation_steps))
    coordinator.print_on_master(f"Warmup steps is set to {args.warmup_steps}")

From `applications/Colossal-LLaMA/train.py:219-224` (cosine annealing schedule):

lr_scheduler = CosineAnnealingWarmupLR(
    optimizer=optimizer,
    total_steps=args.num_epochs * (len(dataloader) // args.accumulation_steps),
    warmup_steps=args.warmup_steps,
    eta_min=0.1 * args.lr,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment