Implementation:Volcengine Verl Split Placement PPO Entry

Knowledge Sources	Volcengine_Verl
Domains	Reinforcement_Learning, Distributed_Training, Resource_Management
Last Updated	2026-02-07 18:00 GMT

Overview

Concrete tool for launching PPO training with split resource pool placement, where actor and critic occupy separate GPU pools, provided by the verl framework.

Description

The main_ppo_split.py module serves as the entry point for PPO training using a split placement strategy. Unlike the default colocated setup where actor and critic share the same GPU pool, this example separates them into distinct resource pools. The module contains:

A RewardManager class that dispatches reward computation to dataset-specific scoring functions (GSM8K or MATH)
A main_task Ray remote function that configures two separate resource pools (actor_rollout_ref_pool and critic_pool), assigns workers to each, and monkey-patches the RayPPOTrainer.fit method with a custom training loop from split_monkey_patch.py

The split placement is achieved by dividing available nodes or GPUs in half between actor and critic resource pools.

Usage

Use this module when you need to run PPO training with actor and critic on separate GPU pools to avoid memory contention. This is the entry point to invoke via Hydra config.

Code Reference

Source Location

Repository: Volcengine_Verl
File: examples/split_placement/main_ppo_split.py
Lines: 1-201

Signature

class RewardManager:
    def __init__(self, tokenizer, num_examine: int) -> None:
        """
        Args:
            tokenizer: HuggingFace tokenizer for decoding sequences.
            num_examine: Number of decoded responses to print per data source.
        """

    def __call__(self, data: DataProto, return_dict: bool = False):
        """Compute reward scores for a batch of generated sequences.

        Args:
            data: DataProto containing prompts, responses, attention_mask,
                  and non_tensor_batch with reward_model ground truth.
            return_dict: If True, return {"reward_tensor": tensor}; else return tensor.

        Returns:
            torch.Tensor or dict: Reward scores placed at last valid response token.
        """

Import

# This module is run as a Hydra main entry point:
python examples/split_placement/main_ppo_split.py \
    --config-path=config --config-name=ppo_trainer_split

I/O Contract

Inputs

Name	Type	Required	Description
config	OmegaConf DictConfig	Yes	Hydra configuration with actor_rollout_ref, critic, trainer sections
tokenizer	PreTrainedTokenizer	Yes	HuggingFace tokenizer (loaded from model path)
data	DataProto	Yes	Batch containing prompts, responses, attention_mask, and reward ground truth

Outputs

Name	Type	Description
reward_tensor	torch.Tensor	Float tensor of shape [batch_size, response_length] with reward at last valid token
checkpoints	files	Model checkpoints saved to configured output directory

Usage Examples

Running Split Placement PPO

# Launch split placement PPO training via command line:
# python examples/split_placement/main_ppo_split.py \
#     --config-path=config --config-name=ppo_trainer_split

# The script automatically:
# 1. Initializes Ray cluster
# 2. Creates two separate GPU resource pools (actor vs critic)
# 3. Monkey-patches RayPPOTrainer.fit with parallel update logic
# 4. Runs the training loop

Using RewardManager Directly

from examples.split_placement.main_ppo_split import RewardManager

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")
reward_mgr = RewardManager(tokenizer=tokenizer, num_examine=2)

# data is a DataProto with prompts, responses, attention_mask
reward_tensor = reward_mgr(data)
# reward_tensor shape: [batch_size, response_length]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment