Implementation:Volcengine Verl Split Placement PPO Entry
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Distributed_Training, Resource_Management |
| Last Updated | 2026-02-07 18:00 GMT |
Overview
Concrete tool for launching PPO training with split resource pool placement, where actor and critic occupy separate GPU pools, provided by the verl framework.
Description
The main_ppo_split.py module serves as the entry point for PPO training using a split placement strategy. Unlike the default colocated setup where actor and critic share the same GPU pool, this example separates them into distinct resource pools. The module contains:
- A RewardManager class that dispatches reward computation to dataset-specific scoring functions (GSM8K or MATH)
- A main_task Ray remote function that configures two separate resource pools (actor_rollout_ref_pool and critic_pool), assigns workers to each, and monkey-patches the RayPPOTrainer.fit method with a custom training loop from split_monkey_patch.py
The split placement is achieved by dividing available nodes or GPUs in half between actor and critic resource pools.
Usage
Use this module when you need to run PPO training with actor and critic on separate GPU pools to avoid memory contention. This is the entry point to invoke via Hydra config.
Code Reference
Source Location
- Repository: Volcengine_Verl
- File: examples/split_placement/main_ppo_split.py
- Lines: 1-201
Signature
class RewardManager:
def __init__(self, tokenizer, num_examine: int) -> None:
"""
Args:
tokenizer: HuggingFace tokenizer for decoding sequences.
num_examine: Number of decoded responses to print per data source.
"""
def __call__(self, data: DataProto, return_dict: bool = False):
"""Compute reward scores for a batch of generated sequences.
Args:
data: DataProto containing prompts, responses, attention_mask,
and non_tensor_batch with reward_model ground truth.
return_dict: If True, return {"reward_tensor": tensor}; else return tensor.
Returns:
torch.Tensor or dict: Reward scores placed at last valid response token.
"""
Import
# This module is run as a Hydra main entry point:
python examples/split_placement/main_ppo_split.py \
--config-path=config --config-name=ppo_trainer_split
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | OmegaConf DictConfig | Yes | Hydra configuration with actor_rollout_ref, critic, trainer sections |
| tokenizer | PreTrainedTokenizer | Yes | HuggingFace tokenizer (loaded from model path) |
| data | DataProto | Yes | Batch containing prompts, responses, attention_mask, and reward ground truth |
Outputs
| Name | Type | Description |
|---|---|---|
| reward_tensor | torch.Tensor | Float tensor of shape [batch_size, response_length] with reward at last valid token |
| checkpoints | files | Model checkpoints saved to configured output directory |
Usage Examples
Running Split Placement PPO
# Launch split placement PPO training via command line:
# python examples/split_placement/main_ppo_split.py \
# --config-path=config --config-name=ppo_trainer_split
# The script automatically:
# 1. Initializes Ray cluster
# 2. Creates two separate GPU resource pools (actor vs critic)
# 3. Monkey-patches RayPPOTrainer.fit with parallel update logic
# 4. Runs the training loop
Using RewardManager Directly
from examples.split_placement.main_ppo_split import RewardManager
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")
reward_mgr = RewardManager(tokenizer=tokenizer, num_examine=2)
# data is a DataProto with prompts, responses, attention_mask
reward_tensor = reward_mgr(data)
# reward_tensor shape: [batch_size, response_length]