Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Volcengine Verl RayPPOTrainer Validate Save

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Training_Infrastructure, Checkpointing
Last Updated 2026-02-07 14:00 GMT

Overview

Concrete tools for periodic validation and checkpoint saving within the distributed PPO trainer, provided by the verl library.

Description

The RayPPOTrainer._validate() method runs validation by generating responses on the validation dataset, computing rewards using the configured reward function (and optionally the reward model), and logging detailed metrics including per-data-source scores, sample outputs, and reward breakdowns. It pads data to be divisible by the data-parallel size, generates sequences using the actor rollout worker, optionally invokes a colocated reward model, and aggregates results for logging.

The RayPPOTrainer._save_checkpoint() method persists actor (and optionally critic) model weights to local storage (and optionally HDFS). It creates a directory structure of default_local_dir/global_step_{N}/actor (and .../critic), saves the dataloader state for resumption, and writes a latest_checkpointed_iteration.txt file for atomic checkpoint tracking. It supports configurable maximum checkpoint retention via max_actor_ckpt_to_keep and max_critic_ckpt_to_keep.

Usage

These methods are called automatically in the training loop based on configuration:

  • _validate() is called every trainer.test_freq steps
  • _save_checkpoint() is called every trainer.save_freq steps
  • The total training runs for trainer.total_epochs epochs
  • Checkpoints are saved to trainer.default_local_dir

Code Reference

Source Location

  • Repository: verl
  • File: verl/trainer/ppo/ray_trainer.py
  • _validate() starts at line: 544
  • _save_checkpoint() starts at line: 920

Signature

class RayPPOTrainer:
    """Distributed PPO trainer using Ray for scalable reinforcement learning."""

    def _validate(self, merged: bool = False):
        """
        Run validation on the validation dataset.

        Generates responses using the actor rollout worker group, computes
        rewards, and logs detailed metrics per data source. Supports both
        standard and async rollout modes.

        Args:
            merged: Whether to use merged model weights for validation.

        Returns:
            dict: Validation metrics including per-source reward statistics.
        """

    def _save_checkpoint(self):
        """
        Save model checkpoint to local storage (and optionally HDFS).

        Creates directory: default_local_dir/global_step_{N}/actor (and /critic).
        Saves dataloader state for resumption. Writes
        latest_checkpointed_iteration.txt for atomic tracking.
        Supports configurable max checkpoint retention.
        """

Import

from verl.trainer.ppo.ray_trainer import RayPPOTrainer

I/O Contract

Inputs (_validate)

Name Type Required Description
merged bool No Whether to use merged model weights (default: False)
(implicit) self.val_dataloader DataLoader Yes Validation data loader set during trainer initialization
(implicit) self.actor_rollout_wg RayWorkerGroup Yes Actor rollout worker group for sequence generation

Outputs (_validate)

Name Type Description
val_metrics dict Dictionary of validation metrics keyed by data source and metric name

Inputs (_save_checkpoint)

Name Type Required Description
(implicit) self.global_steps int Yes Current global training step number
(implicit) self.config.trainer.default_local_dir str Yes Base directory for saving checkpoints
(implicit) self.config.trainer.default_hdfs_dir Optional[str] No Optional HDFS directory for remote checkpoints

Outputs (_save_checkpoint)

Name Type Description
(side effect) files Actor and critic model weights, dataloader state, and iteration tracker written to disk

Configuration Keys

Config Key Type Description
trainer.test_freq int How often (in steps) to run validation
trainer.save_freq int How often (in steps) to save checkpoints
trainer.total_epochs int Total number of training epochs
trainer.default_local_dir str Local directory for checkpoint storage
trainer.default_hdfs_dir Optional[str] Optional HDFS path for remote checkpoint storage
trainer.max_actor_ckpt_to_keep Optional[int] Maximum number of actor checkpoints to retain
trainer.max_critic_ckpt_to_keep Optional[int] Maximum number of critic checkpoints to retain

Usage Examples

# Configuration (YAML)
# trainer:
#   total_epochs: 3
#   test_freq: 50
#   save_freq: 100
#   default_local_dir: /mnt/checkpoints/my_experiment
#   max_actor_ckpt_to_keep: 3
#   max_critic_ckpt_to_keep: 2

# The training loop calls these methods automatically:
from verl.trainer.ppo.ray_trainer import RayPPOTrainer

# During the training loop (simplified):
# for epoch in range(trainer.config.trainer.total_epochs):
#     for batch in trainer.train_dataloader:
#         trainer.global_steps += 1
#
#         # ... training step ...
#
#         # Periodic validation
#         if trainer.global_steps % trainer.config.trainer.test_freq == 0:
#             val_metrics = trainer._validate()
#             # val_metrics contains per-source reward means, stds, etc.
#
#         # Periodic checkpoint saving
#         if trainer.global_steps % trainer.config.trainer.save_freq == 0:
#             trainer._save_checkpoint()
#             # Saves to: /mnt/checkpoints/my_experiment/global_step_100/actor/
#             #           /mnt/checkpoints/my_experiment/global_step_100/critic/
#             #           /mnt/checkpoints/my_experiment/global_step_100/data.pt
#             #           /mnt/checkpoints/my_experiment/latest_checkpointed_iteration.txt

Related Pages

Implements Principle

Environment Requirements

Heuristics Used

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment