Principle:Hpcaitech ColossalAI Optimizer Scheduler Setup

Knowledge Sources	ColossalAI Adam: A Method for Stochastic Optimization Fixing Weight Decay in Adam
Domains	Optimization, Deep_Learning
Last Updated	2026-02-09 00:00 GMT

Overview

An optimization configuration pattern combining a heterogeneous Adam optimizer with cosine annealing learning rate scheduling and linear warmup for stable large-model training.

Description

Training large language models requires careful optimizer and scheduler selection. The HybridAdam optimizer from ColossalAI extends standard AdamW with the ability to handle parameters on both CPU and GPU simultaneously, which is essential for memory-efficient training strategies like ZeRO offloading. It uses fused CUDA kernels for GPU parameters and optimized CPU kernels for offloaded parameters.

The CosineAnnealingWarmupLR scheduler provides a linear warmup phase followed by cosine annealing decay, which is the standard learning rate schedule for LLM training. The warmup prevents early training instability when gradients are large.

Usage

Use this principle when configuring the training loop for any ColossalAI training workflow. HybridAdam is preferred over standard Adam when using ZeRO or Gemini plugins that may offload parameters to CPU. The warmup ratio is typically 3-10% of total training steps.

Theoretical Basis

AdamW update rule: $m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$ $v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$ $θ_{t} = θ_{t - 1} - η (\frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} + λ θ_{t - 1})$

Cosine annealing with warmup:

# Pseudo-code for learning rate schedule
if step < warmup_steps:
    lr = base_lr * step / warmup_steps  # Linear warmup
else:
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    lr = eta_min + 0.5 * (base_lr - eta_min) * (1 + cos(pi * progress))

Related Pages

Implemented By

Implementation:Hpcaitech_ColossalAI_HybridAdam_CosineScheduler

Heuristic Links

Heuristic:Hpcaitech_ColossalAI_Warmup_Steps_Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment