Implementation:Volcengine Verl FSDPSFTTrainer Save Checkpoint

Field	Value
Knowledge Sources	API Doc (verl trainer)
Domains	Checkpointing, Distributed Training, Model Persistence
Last Updated	2026-02-07

Overview

Description

The FSDPSFTTrainer.save_checkpoint(step) method saves a full training checkpoint at a given global step. It uses the FSDPCheckpointManager to gather sharded FSDP model weights across all ranks and save them in HuggingFace format, making the checkpoint directly loadable with AutoModelForCausalLM.from_pretrained().

The method performs the following steps:

Constructs the checkpoint directory path as {default_local_dir}/global_step_{step}
Delegates to FSDPCheckpointManager.save_checkpoint() to save model weights, optimizer state, and LR scheduler state
On rank 0, saves the StatefulDataLoader state dict to data.pt for resumption
On rank 0, atomically updates a checkpoint tracker file that records the latest step number
Optionally copies the checkpoint to HDFS if config.trainer.default_hdfs_dir is set
Calls torch.distributed.barrier() to synchronize all ranks after saving

The checkpoint manager supports a max_ckpt_to_keep parameter for automatic cleanup of older checkpoints.

Usage

Checkpointing is triggered automatically within fit() based on config.trainer.save_freq:

trainer:
  save_freq: 100           # Save every 100 steps
  default_local_dir: ./checkpoints
  default_hdfs_dir: null    # Optional HDFS path
  max_ckpt_to_keep: 3       # Keep only the 3 most recent checkpoints

Code Reference

Attribute	Detail
Source Location	`verl/trainer/fsdp_sft_trainer.py`, Lines 544-585
Signature	`def save_checkpoint(self, step: int) -> None`
Import	`from verl.trainer.fsdp_sft_trainer import FSDPSFTTrainer`

I/O Contract

Inputs

Parameter	Type	Description
`step`	`int`	The global training step number for this checkpoint
`self.config.trainer.default_local_dir`	`str`	Base directory for local checkpoint storage
`self.config.trainer.default_hdfs_dir`	`str` or `None`	Optional HDFS directory for remote copy
`self.config.trainer.max_ckpt_to_keep`	`int` or `None`	Maximum number of checkpoints to retain
`self.config.trainer.save_freq`	`int`	How often (in steps) to trigger checkpoint saves

Outputs

Output	Type	Description
Return value	`None`	Method returns nothing
`{default_local_dir}/global_step_{step}/`	Directory	HuggingFace-format model checkpoint (model weights, config, tokenizer)
`{default_local_dir}/global_step_{step}/data.pt`	File	Serialized `StatefulDataLoader` state dict for resumption
Tracker file	File	Updated with the latest step number (atomic write via temp file + rename)
HDFS copy	Remote files	Optional copy to HDFS if `default_hdfs_dir` is configured

Usage Examples

Example 1: Manual checkpoint save

# Within a training loop:
trainer = FSDPSFTTrainer(config=config, ...)
trainer.save_checkpoint(step=500)
# Creates: ./checkpoints/global_step_500/
# Contains: model weights, optimizer state, data.pt, tracker file

Example 2: Checkpoint save triggered inside fit()

# Inside FSDPSFTTrainer.fit():
for epoch in range(start_epoch, self.config.trainer.total_epochs):
    for step_in_epoch, data in enumerate(self.train_dataloader):
        global_step += 1
        metric = self.training_step(data)

        is_save_step = global_step % self.config.trainer.save_freq == 0
        if is_save_step:
            self.save_checkpoint(step=global_step)

Example 3: Checkpoint directory structure

# After saving at step 500 with default_local_dir="./checkpoints":
#
# ./checkpoints/
#   global_step_500/
#     config.json           # HuggingFace model config
#     model.safetensors     # Model weights (gathered from FSDP shards)
#     tokenizer.json        # Tokenizer files
#     data.pt               # DataLoader state for resumption
#   latest_checkpointed_iteration.txt   # Contains "500"

Example 4: Loading a saved checkpoint for inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./checkpoints/global_step_500")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints/global_step_500")

Related Pages

Principle:Volcengine_Verl_SFT_Checkpointing
verl/trainer/fsdp_sft_trainer.py -- Source file
verl/utils/checkpoint/fsdp_checkpoint_manager.py -- FSDPCheckpointManager implementation
Implementation:Volcengine_Verl_FSDPSFTTrainer_Fit -- Training loop that triggers checkpoint saves

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment