Principle:Alibaba ROLL Model Checkpointing
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Model_Management |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A model persistence principle for periodically saving training state across distributed workers with lifecycle management and remote upload capabilities.
Description
Model Checkpointing ensures training progress is preserved by periodically saving model weights, optimizer states, pipeline state, and RNG states to persistent storage. In a distributed multi-cluster setup, checkpointing must coordinate across all worker roles (actor, critic, reference) to produce a consistent snapshot. The principle covers:
- Checkpoint frequency: Saving at configurable step intervals and at the final training step
- Distributed coordination: All checkpoint clusters save their state in parallel
- Pipeline state: Saving training step, metrics history, and random number generator states
- Lifecycle management: Rotating old checkpoints to limit disk usage (max_ckpt_to_keep)
- Remote upload: Uploading checkpoints to remote storage (OSS/S3)
Usage
Use this principle at the end of each training iteration (or at configured intervals) to save training progress. Checkpointing is shared across all ROLL pipelines.
Theoretical Basis
Checkpoint structure follows a hierarchical layout:
Pseudo-code:
# Abstract checkpointing flow
if global_step % save_steps == 0 or is_last_step:
for cluster in checkpoint_clusters:
cluster.save_state(output_dir / f"checkpoint-{step}")
save_pipeline_state(metrics, rng_state)
upload_to_remote(checkpoint_dir)
cleanup_old_checkpoints(max_keep=3)
Related Pages
Implemented By
Related Heuristics
No specific heuristics inform this principle.