Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Hpcaitech ColossalAI Distributed Checkpoint Saving

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Infrastructure
Last Updated 2026-02-09 00:00 GMT

Overview

A distributed checkpointing pattern that persists sharded model weights, optimizer states, scheduler states, and training metadata across multiple ranks for fault-tolerant resumable training.

Description

Distributed Checkpoint Saving solves the problem of persisting training state in distributed training environments where model parameters, optimizer states, and gradients may be partitioned across multiple GPUs. The checkpointing system coordinates across all ranks to save a consistent snapshot that can be used for:

  • Training Resumption: Resume from the exact step after a failure
  • Model Export: Extract the final model for inference
  • Evaluation Checkpoints: Save periodic snapshots for evaluation

The checkpoint structure includes sharded model weights, sharded optimizer states, LR scheduler state, and a metadata JSON file tracking the current epoch, step, and sample index.

Usage

Use this principle during and after training to save progress. The save frequency is a trade-off between checkpoint overhead and data loss risk.

Theoretical Basis

Checkpoint structure:

# Abstract checkpoint layout
checkpoint_dir/
    epoch-{N}_step-{M}/
        modeling/          # Sharded model weights
        optimizer/         # Sharded optimizer states
        lr_scheduler       # LR scheduler state_dict
        running_states.json  # {"epoch": N, "step": M, "sample_start_index": ...}

Key considerations:

  1. Consistency: All ranks must save at the same logical step
  2. Sharding: Large models are saved in shards to avoid OOM
  3. Atomicity: Partial saves should not corrupt previous checkpoints

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment