Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Hpcaitech ColossalAI Optimizer Scheduler Setup

From Leeroopedia


Knowledge Sources
Domains Optimization, Deep_Learning
Last Updated 2026-02-09 00:00 GMT

Overview

An optimization configuration pattern combining a heterogeneous Adam optimizer with cosine annealing learning rate scheduling and linear warmup for stable large-model training.

Description

Training large language models requires careful optimizer and scheduler selection. The HybridAdam optimizer from ColossalAI extends standard AdamW with the ability to handle parameters on both CPU and GPU simultaneously, which is essential for memory-efficient training strategies like ZeRO offloading. It uses fused CUDA kernels for GPU parameters and optimized CPU kernels for offloaded parameters.

The CosineAnnealingWarmupLR scheduler provides a linear warmup phase followed by cosine annealing decay, which is the standard learning rate schedule for LLM training. The warmup prevents early training instability when gradients are large.

Usage

Use this principle when configuring the training loop for any ColossalAI training workflow. HybridAdam is preferred over standard Adam when using ZeRO or Gemini plugins that may offload parameters to CPU. The warmup ratio is typically 3-10% of total training steps.

Theoretical Basis

AdamW update rule: mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 θt=θt1η(m^tv^t+ϵ+λθt1)

Cosine annealing with warmup:

# Pseudo-code for learning rate schedule
if step < warmup_steps:
    lr = base_lr * step / warmup_steps  # Linear warmup
else:
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    lr = eta_min + 0.5 * (base_lr - eta_min) * (1 + cos(pi * progress))

Related Pages

Implemented By

Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment