Implementation:Alibaba ROLL BasePipeline Do Checkpoint
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Model_Management |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete checkpoint management method for distributed training pipelines provided by the Alibaba ROLL library.
Description
The BasePipeline.do_checkpoint method coordinates checkpoint saving across all registered clusters, saves pipeline state (metrics, RNG), manages checkpoint lifecycle (rotation, cleanup), and optionally uploads to remote storage. It is the shared checkpointing implementation inherited by all ROLL pipelines (RLVR, Agentic, DPO, SFT, Distill).
Usage
This method is called by the pipeline's training loop at configured intervals (save_steps) and at the final training step.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/pipeline/base_pipeline.py
- Lines: L78-108
Signature
class BasePipeline:
def do_checkpoint(
self,
global_step: int,
is_last_step: Optional[bool] = None
) -> None:
"""
Save checkpoint if criteria are met.
Args:
global_step: Current training step
is_last_step: Whether this is the final training step
Process:
1. Check if save_steps interval reached or is_last_step
2. Call do_checkpoint on all checkpoint_clusters (non-blocking)
3. Save pipeline state (metrics, RNG state)
4. Upload to remote storage
5. Clean up old checkpoints (max_ckpt_to_keep)
"""
Import
from roll.pipeline.base_pipeline import BasePipeline
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| global_step | int | Yes | Current training step number |
| is_last_step | Optional[bool] | No | Whether this is the final step (auto-detected if None) |
Outputs
| Name | Type | Description |
|---|---|---|
| Checkpoint directory | Files | Model weights, optimizer states saved to output_dir/checkpoint-{step}/ |
| Pipeline state | JSON | Metrics history, RNG states saved as JSON/pth files |
Usage Examples
Checkpoint in Training Loop
# Called within the pipeline's run() method:
for step in range(max_steps):
# ... training logic ...
# Checkpoint at configured intervals
self.do_checkpoint(
global_step=step,
is_last_step=(step == max_steps - 1)
)
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
No specific heuristics apply to this implementation.