Principle:Alibaba ROLL Model Checkpointing

Knowledge Sources	PyTorch Checkpointing Alibaba ROLL
Domains	Distributed_Training, Model_Management
Last Updated	2026-02-07 20:00 GMT

Overview

A model persistence principle for periodically saving training state across distributed workers with lifecycle management and remote upload capabilities.

Description

Model Checkpointing ensures training progress is preserved by periodically saving model weights, optimizer states, pipeline state, and RNG states to persistent storage. In a distributed multi-cluster setup, checkpointing must coordinate across all worker roles (actor, critic, reference) to produce a consistent snapshot. The principle covers:

Checkpoint frequency: Saving at configurable step intervals and at the final training step
Distributed coordination: All checkpoint clusters save their state in parallel
Pipeline state: Saving training step, metrics history, and random number generator states
Lifecycle management: Rotating old checkpoints to limit disk usage (max_ckpt_to_keep)
Remote upload: Uploading checkpoints to remote storage (OSS/S3)

Usage

Use this principle at the end of each training iteration (or at configured intervals) to save training progress. Checkpointing is shared across all ROLL pipelines.

Theoretical Basis

Checkpoint structure follows a hierarchical layout:

Pseudo-code:

# Abstract checkpointing flow
if global_step % save_steps == 0 or is_last_step:
    for cluster in checkpoint_clusters:
        cluster.save_state(output_dir / f"checkpoint-{step}")
    save_pipeline_state(metrics, rng_state)
    upload_to_remote(checkpoint_dir)
    cleanup_old_checkpoints(max_keep=3)

Related Pages

Implemented By

Implementation:Alibaba_ROLL_BasePipeline_Do_Checkpoint

Related Heuristics

No specific heuristics inform this principle.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment