Implementation:Volcengine Verl FSDPSFTTrainer Save Checkpoint
| Field | Value |
|---|---|
| Knowledge Sources | API Doc (verl trainer) |
| Domains | Checkpointing, Distributed Training, Model Persistence |
| Last Updated | 2026-02-07 |
Overview
Description
The FSDPSFTTrainer.save_checkpoint(step) method saves a full training checkpoint at a given global step. It uses the FSDPCheckpointManager to gather sharded FSDP model weights across all ranks and save them in HuggingFace format, making the checkpoint directly loadable with AutoModelForCausalLM.from_pretrained().
The method performs the following steps:
- Constructs the checkpoint directory path as
{default_local_dir}/global_step_{step} - Delegates to
FSDPCheckpointManager.save_checkpoint()to save model weights, optimizer state, and LR scheduler state - On rank 0, saves the
StatefulDataLoaderstate dict todata.ptfor resumption - On rank 0, atomically updates a checkpoint tracker file that records the latest step number
- Optionally copies the checkpoint to HDFS if
config.trainer.default_hdfs_diris set - Calls
torch.distributed.barrier()to synchronize all ranks after saving
The checkpoint manager supports a max_ckpt_to_keep parameter for automatic cleanup of older checkpoints.
Usage
Checkpointing is triggered automatically within fit() based on config.trainer.save_freq:
trainer:
save_freq: 100 # Save every 100 steps
default_local_dir: ./checkpoints
default_hdfs_dir: null # Optional HDFS path
max_ckpt_to_keep: 3 # Keep only the 3 most recent checkpoints
Code Reference
| Attribute | Detail |
|---|---|
| Source Location | verl/trainer/fsdp_sft_trainer.py, Lines 544-585
|
| Signature | def save_checkpoint(self, step: int) -> None
|
| Import | from verl.trainer.fsdp_sft_trainer import FSDPSFTTrainer
|
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
step |
int |
The global training step number for this checkpoint |
self.config.trainer.default_local_dir |
str |
Base directory for local checkpoint storage |
self.config.trainer.default_hdfs_dir |
str or None |
Optional HDFS directory for remote copy |
self.config.trainer.max_ckpt_to_keep |
int or None |
Maximum number of checkpoints to retain |
self.config.trainer.save_freq |
int |
How often (in steps) to trigger checkpoint saves |
Outputs
| Output | Type | Description |
|---|---|---|
| Return value | None |
Method returns nothing |
{default_local_dir}/global_step_{step}/ |
Directory | HuggingFace-format model checkpoint (model weights, config, tokenizer) |
{default_local_dir}/global_step_{step}/data.pt |
File | Serialized StatefulDataLoader state dict for resumption
|
| Tracker file | File | Updated with the latest step number (atomic write via temp file + rename) |
| HDFS copy | Remote files | Optional copy to HDFS if default_hdfs_dir is configured
|
Usage Examples
Example 1: Manual checkpoint save
# Within a training loop:
trainer = FSDPSFTTrainer(config=config, ...)
trainer.save_checkpoint(step=500)
# Creates: ./checkpoints/global_step_500/
# Contains: model weights, optimizer state, data.pt, tracker file
Example 2: Checkpoint save triggered inside fit()
# Inside FSDPSFTTrainer.fit():
for epoch in range(start_epoch, self.config.trainer.total_epochs):
for step_in_epoch, data in enumerate(self.train_dataloader):
global_step += 1
metric = self.training_step(data)
is_save_step = global_step % self.config.trainer.save_freq == 0
if is_save_step:
self.save_checkpoint(step=global_step)
Example 3: Checkpoint directory structure
# After saving at step 500 with default_local_dir="./checkpoints":
#
# ./checkpoints/
# global_step_500/
# config.json # HuggingFace model config
# model.safetensors # Model weights (gathered from FSDP shards)
# tokenizer.json # Tokenizer files
# data.pt # DataLoader state for resumption
# latest_checkpointed_iteration.txt # Contains "500"
Example 4: Loading a saved checkpoint for inference
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints/global_step_500")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints/global_step_500")
Related Pages
- Principle:Volcengine_Verl_SFT_Checkpointing
- verl/trainer/fsdp_sft_trainer.py -- Source file
- verl/utils/checkpoint/fsdp_checkpoint_manager.py -- FSDPCheckpointManager implementation
- Implementation:Volcengine_Verl_FSDPSFTTrainer_Fit -- Training loop that triggers checkpoint saves