Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL BasePipeline Do Checkpoint

From Leeroopedia


Knowledge Sources
Domains Distributed_Training, Model_Management
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete checkpoint management method for distributed training pipelines provided by the Alibaba ROLL library.

Description

The BasePipeline.do_checkpoint method coordinates checkpoint saving across all registered clusters, saves pipeline state (metrics, RNG), manages checkpoint lifecycle (rotation, cleanup), and optionally uploads to remote storage. It is the shared checkpointing implementation inherited by all ROLL pipelines (RLVR, Agentic, DPO, SFT, Distill).

Usage

This method is called by the pipeline's training loop at configured intervals (save_steps) and at the final training step.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/pipeline/base_pipeline.py
  • Lines: L78-108

Signature

class BasePipeline:
    def do_checkpoint(
        self,
        global_step: int,
        is_last_step: Optional[bool] = None
    ) -> None:
        """
        Save checkpoint if criteria are met.

        Args:
            global_step: Current training step
            is_last_step: Whether this is the final training step

        Process:
        1. Check if save_steps interval reached or is_last_step
        2. Call do_checkpoint on all checkpoint_clusters (non-blocking)
        3. Save pipeline state (metrics, RNG state)
        4. Upload to remote storage
        5. Clean up old checkpoints (max_ckpt_to_keep)
        """

Import

from roll.pipeline.base_pipeline import BasePipeline

I/O Contract

Inputs

Name Type Required Description
global_step int Yes Current training step number
is_last_step Optional[bool] No Whether this is the final step (auto-detected if None)

Outputs

Name Type Description
Checkpoint directory Files Model weights, optimizer states saved to output_dir/checkpoint-{step}/
Pipeline state JSON Metrics history, RNG states saved as JSON/pth files

Usage Examples

Checkpoint in Training Loop

# Called within the pipeline's run() method:
for step in range(max_steps):
    # ... training logic ...

    # Checkpoint at configured intervals
    self.do_checkpoint(
        global_step=step,
        is_last_step=(step == max_steps - 1)
    )

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

No specific heuristics apply to this implementation.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment