Principle:Hpcaitech ColossalAI Optimizer Scheduler Setup
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An optimization configuration pattern combining a heterogeneous Adam optimizer with cosine annealing learning rate scheduling and linear warmup for stable large-model training.
Description
Training large language models requires careful optimizer and scheduler selection. The HybridAdam optimizer from ColossalAI extends standard AdamW with the ability to handle parameters on both CPU and GPU simultaneously, which is essential for memory-efficient training strategies like ZeRO offloading. It uses fused CUDA kernels for GPU parameters and optimized CPU kernels for offloaded parameters.
The CosineAnnealingWarmupLR scheduler provides a linear warmup phase followed by cosine annealing decay, which is the standard learning rate schedule for LLM training. The warmup prevents early training instability when gradients are large.
Usage
Use this principle when configuring the training loop for any ColossalAI training workflow. HybridAdam is preferred over standard Adam when using ZeRO or Gemini plugins that may offload parameters to CPU. The warmup ratio is typically 3-10% of total training steps.
Theoretical Basis
AdamW update rule:
Cosine annealing with warmup:
# Pseudo-code for learning rate schedule
if step < warmup_steps:
lr = base_lr * step / warmup_steps # Linear warmup
else:
progress = (step - warmup_steps) / (total_steps - warmup_steps)
lr = eta_min + 0.5 * (base_lr - eta_min) * (1 + cos(pi * progress))