Principle:Hpcaitech ColossalAI Distributed Checkpoint Saving
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Infrastructure |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A distributed checkpointing pattern that persists sharded model weights, optimizer states, scheduler states, and training metadata across multiple ranks for fault-tolerant resumable training.
Description
Distributed Checkpoint Saving solves the problem of persisting training state in distributed training environments where model parameters, optimizer states, and gradients may be partitioned across multiple GPUs. The checkpointing system coordinates across all ranks to save a consistent snapshot that can be used for:
- Training Resumption: Resume from the exact step after a failure
- Model Export: Extract the final model for inference
- Evaluation Checkpoints: Save periodic snapshots for evaluation
The checkpoint structure includes sharded model weights, sharded optimizer states, LR scheduler state, and a metadata JSON file tracking the current epoch, step, and sample index.
Usage
Use this principle during and after training to save progress. The save frequency is a trade-off between checkpoint overhead and data loss risk.
Theoretical Basis
Checkpoint structure:
# Abstract checkpoint layout
checkpoint_dir/
epoch-{N}_step-{M}/
modeling/ # Sharded model weights
optimizer/ # Sharded optimizer states
lr_scheduler # LR scheduler state_dict
running_states.json # {"epoch": N, "step": M, "sample_start_index": ...}
Key considerations:
- Consistency: All ranks must save at the same logical step
- Sharding: Large models are saved in shards to avoid OOM
- Atomicity: Partial saves should not corrupt previous checkpoints