Implementation:Alibaba ROLL BasePipeline Do Checkpoint

Knowledge Sources	Alibaba ROLL
Domains	Distributed_Training, Model_Management
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete checkpoint management method for distributed training pipelines provided by the Alibaba ROLL library.

Description

The BasePipeline.do_checkpoint method coordinates checkpoint saving across all registered clusters, saves pipeline state (metrics, RNG), manages checkpoint lifecycle (rotation, cleanup), and optionally uploads to remote storage. It is the shared checkpointing implementation inherited by all ROLL pipelines (RLVR, Agentic, DPO, SFT, Distill).

Usage

This method is called by the pipeline's training loop at configured intervals (save_steps) and at the final training step.

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/pipeline/base_pipeline.py
Lines: L78-108

Signature

class BasePipeline:
    def do_checkpoint(
        self,
        global_step: int,
        is_last_step: Optional[bool] = None
    ) -> None:
        """
        Save checkpoint if criteria are met.

        Args:
            global_step: Current training step
            is_last_step: Whether this is the final training step

        Process:
        1. Check if save_steps interval reached or is_last_step
        2. Call do_checkpoint on all checkpoint_clusters (non-blocking)
        3. Save pipeline state (metrics, RNG state)
        4. Upload to remote storage
        5. Clean up old checkpoints (max_ckpt_to_keep)
        """

Import

from roll.pipeline.base_pipeline import BasePipeline

I/O Contract

Inputs

Name	Type	Required	Description
global_step	int	Yes	Current training step number
is_last_step	Optional[bool]	No	Whether this is the final step (auto-detected if None)

Outputs

Name	Type	Description
Checkpoint directory	Files	Model weights, optimizer states saved to output_dir/checkpoint-{step}/
Pipeline state	JSON	Metrics history, RNG states saved as JSON/pth files

Usage Examples

Checkpoint in Training Loop

# Called within the pipeline's run() method:
for step in range(max_steps):
    # ... training logic ...

    # Checkpoint at configured intervals
    self.do_checkpoint(
        global_step=step,
        is_last_step=(step == max_steps - 1)
    )

Related Pages

Implements Principle

Principle:Alibaba_ROLL_Model_Checkpointing

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Environment:Alibaba_ROLL_Python_Runtime_Environment

Heuristics Applied

No specific heuristics apply to this implementation.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment