Principle:Hpcaitech ColossalAI Distributed Checkpoint Saving

Knowledge Sources	ColossalAI
Domains	Distributed_Computing, Infrastructure
Last Updated	2026-02-09 00:00 GMT

Overview

A distributed checkpointing pattern that persists sharded model weights, optimizer states, scheduler states, and training metadata across multiple ranks for fault-tolerant resumable training.

Description

Distributed Checkpoint Saving solves the problem of persisting training state in distributed training environments where model parameters, optimizer states, and gradients may be partitioned across multiple GPUs. The checkpointing system coordinates across all ranks to save a consistent snapshot that can be used for:

Training Resumption: Resume from the exact step after a failure
Model Export: Extract the final model for inference
Evaluation Checkpoints: Save periodic snapshots for evaluation

The checkpoint structure includes sharded model weights, sharded optimizer states, LR scheduler state, and a metadata JSON file tracking the current epoch, step, and sample index.

Usage

Use this principle during and after training to save progress. The save frequency is a trade-off between checkpoint overhead and data loss risk.

Theoretical Basis

Checkpoint structure:

# Abstract checkpoint layout
checkpoint_dir/
    epoch-{N}_step-{M}/
        modeling/          # Sharded model weights
        optimizer/         # Sharded optimizer states
        lr_scheduler       # LR scheduler state_dict
        running_states.json  # {"epoch": N, "step": M, "sample_start_index": ...}

Key considerations:

Consistency: All ranks must save at the same logical step
Sharding: Large models are saved in shards to avoid OOM
Atomicity: Partial saves should not corrupt previous checkpoints

Related Pages

Implemented By

Implementation:Hpcaitech_ColossalAI_Save_Checkpoint_SFT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment