Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Volcengine Verl FSDPSFTTrainer Save Checkpoint

From Leeroopedia


Field Value
Knowledge Sources API Doc (verl trainer)
Domains Checkpointing, Distributed Training, Model Persistence
Last Updated 2026-02-07

Overview

Description

The FSDPSFTTrainer.save_checkpoint(step) method saves a full training checkpoint at a given global step. It uses the FSDPCheckpointManager to gather sharded FSDP model weights across all ranks and save them in HuggingFace format, making the checkpoint directly loadable with AutoModelForCausalLM.from_pretrained().

The method performs the following steps:

  1. Constructs the checkpoint directory path as {default_local_dir}/global_step_{step}
  2. Delegates to FSDPCheckpointManager.save_checkpoint() to save model weights, optimizer state, and LR scheduler state
  3. On rank 0, saves the StatefulDataLoader state dict to data.pt for resumption
  4. On rank 0, atomically updates a checkpoint tracker file that records the latest step number
  5. Optionally copies the checkpoint to HDFS if config.trainer.default_hdfs_dir is set
  6. Calls torch.distributed.barrier() to synchronize all ranks after saving

The checkpoint manager supports a max_ckpt_to_keep parameter for automatic cleanup of older checkpoints.

Usage

Checkpointing is triggered automatically within fit() based on config.trainer.save_freq:

trainer:
  save_freq: 100           # Save every 100 steps
  default_local_dir: ./checkpoints
  default_hdfs_dir: null    # Optional HDFS path
  max_ckpt_to_keep: 3       # Keep only the 3 most recent checkpoints

Code Reference

Attribute Detail
Source Location verl/trainer/fsdp_sft_trainer.py, Lines 544-585
Signature def save_checkpoint(self, step: int) -> None
Import from verl.trainer.fsdp_sft_trainer import FSDPSFTTrainer

I/O Contract

Inputs

Parameter Type Description
step int The global training step number for this checkpoint
self.config.trainer.default_local_dir str Base directory for local checkpoint storage
self.config.trainer.default_hdfs_dir str or None Optional HDFS directory for remote copy
self.config.trainer.max_ckpt_to_keep int or None Maximum number of checkpoints to retain
self.config.trainer.save_freq int How often (in steps) to trigger checkpoint saves

Outputs

Output Type Description
Return value None Method returns nothing
{default_local_dir}/global_step_{step}/ Directory HuggingFace-format model checkpoint (model weights, config, tokenizer)
{default_local_dir}/global_step_{step}/data.pt File Serialized StatefulDataLoader state dict for resumption
Tracker file File Updated with the latest step number (atomic write via temp file + rename)
HDFS copy Remote files Optional copy to HDFS if default_hdfs_dir is configured

Usage Examples

Example 1: Manual checkpoint save

# Within a training loop:
trainer = FSDPSFTTrainer(config=config, ...)
trainer.save_checkpoint(step=500)
# Creates: ./checkpoints/global_step_500/
# Contains: model weights, optimizer state, data.pt, tracker file

Example 2: Checkpoint save triggered inside fit()

# Inside FSDPSFTTrainer.fit():
for epoch in range(start_epoch, self.config.trainer.total_epochs):
    for step_in_epoch, data in enumerate(self.train_dataloader):
        global_step += 1
        metric = self.training_step(data)

        is_save_step = global_step % self.config.trainer.save_freq == 0
        if is_save_step:
            self.save_checkpoint(step=global_step)

Example 3: Checkpoint directory structure

# After saving at step 500 with default_local_dir="./checkpoints":
#
# ./checkpoints/
#   global_step_500/
#     config.json           # HuggingFace model config
#     model.safetensors     # Model weights (gathered from FSDP shards)
#     tokenizer.json        # Tokenizer files
#     data.pt               # DataLoader state for resumption
#   latest_checkpointed_iteration.txt   # Contains "500"

Example 4: Loading a saved checkpoint for inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./checkpoints/global_step_500")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints/global_step_500")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment