Overview
Concrete tools for periodic validation and checkpoint saving within the distributed PPO trainer, provided by the verl library.
Description
The RayPPOTrainer._validate() method runs validation by generating responses on the validation dataset, computing rewards using the configured reward function (and optionally the reward model), and logging detailed metrics including per-data-source scores, sample outputs, and reward breakdowns. It pads data to be divisible by the data-parallel size, generates sequences using the actor rollout worker, optionally invokes a colocated reward model, and aggregates results for logging.
The RayPPOTrainer._save_checkpoint() method persists actor (and optionally critic) model weights to local storage (and optionally HDFS). It creates a directory structure of default_local_dir/global_step_{N}/actor (and .../critic), saves the dataloader state for resumption, and writes a latest_checkpointed_iteration.txt file for atomic checkpoint tracking. It supports configurable maximum checkpoint retention via max_actor_ckpt_to_keep and max_critic_ckpt_to_keep.
Usage
These methods are called automatically in the training loop based on configuration:
- _validate() is called every trainer.test_freq steps
- _save_checkpoint() is called every trainer.save_freq steps
- The total training runs for trainer.total_epochs epochs
- Checkpoints are saved to trainer.default_local_dir
Code Reference
Source Location
- Repository: verl
- File: verl/trainer/ppo/ray_trainer.py
- _validate() starts at line: 544
- _save_checkpoint() starts at line: 920
Signature
class RayPPOTrainer:
"""Distributed PPO trainer using Ray for scalable reinforcement learning."""
def _validate(self, merged: bool = False):
"""
Run validation on the validation dataset.
Generates responses using the actor rollout worker group, computes
rewards, and logs detailed metrics per data source. Supports both
standard and async rollout modes.
Args:
merged: Whether to use merged model weights for validation.
Returns:
dict: Validation metrics including per-source reward statistics.
"""
def _save_checkpoint(self):
"""
Save model checkpoint to local storage (and optionally HDFS).
Creates directory: default_local_dir/global_step_{N}/actor (and /critic).
Saves dataloader state for resumption. Writes
latest_checkpointed_iteration.txt for atomic tracking.
Supports configurable max checkpoint retention.
"""
Import
from verl.trainer.ppo.ray_trainer import RayPPOTrainer
I/O Contract
Inputs (_validate)
| Name |
Type |
Required |
Description
|
| merged |
bool |
No |
Whether to use merged model weights (default: False)
|
| (implicit) self.val_dataloader |
DataLoader |
Yes |
Validation data loader set during trainer initialization
|
| (implicit) self.actor_rollout_wg |
RayWorkerGroup |
Yes |
Actor rollout worker group for sequence generation
|
Outputs (_validate)
| Name |
Type |
Description
|
| val_metrics |
dict |
Dictionary of validation metrics keyed by data source and metric name
|
Inputs (_save_checkpoint)
| Name |
Type |
Required |
Description
|
| (implicit) self.global_steps |
int |
Yes |
Current global training step number
|
| (implicit) self.config.trainer.default_local_dir |
str |
Yes |
Base directory for saving checkpoints
|
| (implicit) self.config.trainer.default_hdfs_dir |
Optional[str] |
No |
Optional HDFS directory for remote checkpoints
|
Outputs (_save_checkpoint)
| Name |
Type |
Description
|
| (side effect) |
files |
Actor and critic model weights, dataloader state, and iteration tracker written to disk
|
Configuration Keys
| Config Key |
Type |
Description
|
| trainer.test_freq |
int |
How often (in steps) to run validation
|
| trainer.save_freq |
int |
How often (in steps) to save checkpoints
|
| trainer.total_epochs |
int |
Total number of training epochs
|
| trainer.default_local_dir |
str |
Local directory for checkpoint storage
|
| trainer.default_hdfs_dir |
Optional[str] |
Optional HDFS path for remote checkpoint storage
|
| trainer.max_actor_ckpt_to_keep |
Optional[int] |
Maximum number of actor checkpoints to retain
|
| trainer.max_critic_ckpt_to_keep |
Optional[int] |
Maximum number of critic checkpoints to retain
|
Usage Examples
# Configuration (YAML)
# trainer:
# total_epochs: 3
# test_freq: 50
# save_freq: 100
# default_local_dir: /mnt/checkpoints/my_experiment
# max_actor_ckpt_to_keep: 3
# max_critic_ckpt_to_keep: 2
# The training loop calls these methods automatically:
from verl.trainer.ppo.ray_trainer import RayPPOTrainer
# During the training loop (simplified):
# for epoch in range(trainer.config.trainer.total_epochs):
# for batch in trainer.train_dataloader:
# trainer.global_steps += 1
#
# # ... training step ...
#
# # Periodic validation
# if trainer.global_steps % trainer.config.trainer.test_freq == 0:
# val_metrics = trainer._validate()
# # val_metrics contains per-source reward means, stds, etc.
#
# # Periodic checkpoint saving
# if trainer.global_steps % trainer.config.trainer.save_freq == 0:
# trainer._save_checkpoint()
# # Saves to: /mnt/checkpoints/my_experiment/global_step_100/actor/
# # /mnt/checkpoints/my_experiment/global_step_100/critic/
# # /mnt/checkpoints/my_experiment/global_step_100/data.pt
# # /mnt/checkpoints/my_experiment/latest_checkpointed_iteration.txt
Related Pages
Implements Principle
Environment Requirements
Heuristics Used
Page Connections
Double-click a node to navigate. Hold to expand connections.