Implementation:Volcengine Verl RayPPOTrainer Validate Save

Knowledge Sources	verl
Domains	Reinforcement_Learning, Training_Infrastructure, Checkpointing
Last Updated	2026-02-07 14:00 GMT

Overview

Concrete tools for periodic validation and checkpoint saving within the distributed PPO trainer, provided by the verl library.

Description

The RayPPOTrainer._validate() method runs validation by generating responses on the validation dataset, computing rewards using the configured reward function (and optionally the reward model), and logging detailed metrics including per-data-source scores, sample outputs, and reward breakdowns. It pads data to be divisible by the data-parallel size, generates sequences using the actor rollout worker, optionally invokes a colocated reward model, and aggregates results for logging.

The RayPPOTrainer._save_checkpoint() method persists actor (and optionally critic) model weights to local storage (and optionally HDFS). It creates a directory structure of default_local_dir/global_step_{N}/actor (and .../critic), saves the dataloader state for resumption, and writes a latest_checkpointed_iteration.txt file for atomic checkpoint tracking. It supports configurable maximum checkpoint retention via max_actor_ckpt_to_keep and max_critic_ckpt_to_keep.

Usage

These methods are called automatically in the training loop based on configuration:

_validate() is called every trainer.test_freq steps
_save_checkpoint() is called every trainer.save_freq steps
The total training runs for trainer.total_epochs epochs
Checkpoints are saved to trainer.default_local_dir

Code Reference

Source Location

Repository: verl
File: verl/trainer/ppo/ray_trainer.py
_validate() starts at line: 544
_save_checkpoint() starts at line: 920

Signature

class RayPPOTrainer:
    """Distributed PPO trainer using Ray for scalable reinforcement learning."""

    def _validate(self, merged: bool = False):
        """
        Run validation on the validation dataset.

        Generates responses using the actor rollout worker group, computes
        rewards, and logs detailed metrics per data source. Supports both
        standard and async rollout modes.

        Args:
            merged: Whether to use merged model weights for validation.

        Returns:
            dict: Validation metrics including per-source reward statistics.
        """

    def _save_checkpoint(self):
        """
        Save model checkpoint to local storage (and optionally HDFS).

        Creates directory: default_local_dir/global_step_{N}/actor (and /critic).
        Saves dataloader state for resumption. Writes
        latest_checkpointed_iteration.txt for atomic tracking.
        Supports configurable max checkpoint retention.
        """

Import

from verl.trainer.ppo.ray_trainer import RayPPOTrainer

I/O Contract

Inputs (_validate)

Name	Type	Required	Description
merged	bool	No	Whether to use merged model weights (default: False)
(implicit) self.val_dataloader	DataLoader	Yes	Validation data loader set during trainer initialization
(implicit) self.actor_rollout_wg	RayWorkerGroup	Yes	Actor rollout worker group for sequence generation

Outputs (_validate)

Name	Type	Description
val_metrics	dict	Dictionary of validation metrics keyed by data source and metric name

Inputs (_save_checkpoint)

Name	Type	Required	Description
(implicit) self.global_steps	int	Yes	Current global training step number
(implicit) self.config.trainer.default_local_dir	str	Yes	Base directory for saving checkpoints
(implicit) self.config.trainer.default_hdfs_dir	Optional[str]	No	Optional HDFS directory for remote checkpoints

Outputs (_save_checkpoint)

Name	Type	Description
(side effect)	files	Actor and critic model weights, dataloader state, and iteration tracker written to disk

Configuration Keys

Config Key	Type	Description
trainer.test_freq	int	How often (in steps) to run validation
trainer.save_freq	int	How often (in steps) to save checkpoints
trainer.total_epochs	int	Total number of training epochs
trainer.default_local_dir	str	Local directory for checkpoint storage
trainer.default_hdfs_dir	Optional[str]	Optional HDFS path for remote checkpoint storage
trainer.max_actor_ckpt_to_keep	Optional[int]	Maximum number of actor checkpoints to retain
trainer.max_critic_ckpt_to_keep	Optional[int]	Maximum number of critic checkpoints to retain

Usage Examples

# Configuration (YAML)
# trainer:
#   total_epochs: 3
#   test_freq: 50
#   save_freq: 100
#   default_local_dir: /mnt/checkpoints/my_experiment
#   max_actor_ckpt_to_keep: 3
#   max_critic_ckpt_to_keep: 2

# The training loop calls these methods automatically:
from verl.trainer.ppo.ray_trainer import RayPPOTrainer

# During the training loop (simplified):
# for epoch in range(trainer.config.trainer.total_epochs):
#     for batch in trainer.train_dataloader:
#         trainer.global_steps += 1
#
#         # ... training step ...
#
#         # Periodic validation
#         if trainer.global_steps % trainer.config.trainer.test_freq == 0:
#             val_metrics = trainer._validate()
#             # val_metrics contains per-source reward means, stds, etc.
#
#         # Periodic checkpoint saving
#         if trainer.global_steps % trainer.config.trainer.save_freq == 0:
#             trainer._save_checkpoint()
#             # Saves to: /mnt/checkpoints/my_experiment/global_step_100/actor/
#             #           /mnt/checkpoints/my_experiment/global_step_100/critic/
#             #           /mnt/checkpoints/my_experiment/global_step_100/data.pt
#             #           /mnt/checkpoints/my_experiment/latest_checkpointed_iteration.txt

Related Pages

Implements Principle

Principle:Volcengine_Verl_Evaluation_And_Checkpointing

Environment Requirements

Heuristics Used

Heuristic:Volcengine_Verl_Layered_Summon_Memory_Tradeoff

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment