Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl Split Placement PPO Entry

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Distributed_Training, Resource_Management
Last Updated 2026-02-07 18:00 GMT

Overview

Concrete tool for launching PPO training with split resource pool placement, where actor and critic occupy separate GPU pools, provided by the verl framework.

Description

The main_ppo_split.py module serves as the entry point for PPO training using a split placement strategy. Unlike the default colocated setup where actor and critic share the same GPU pool, this example separates them into distinct resource pools. The module contains:

  • A RewardManager class that dispatches reward computation to dataset-specific scoring functions (GSM8K or MATH)
  • A main_task Ray remote function that configures two separate resource pools (actor_rollout_ref_pool and critic_pool), assigns workers to each, and monkey-patches the RayPPOTrainer.fit method with a custom training loop from split_monkey_patch.py

The split placement is achieved by dividing available nodes or GPUs in half between actor and critic resource pools.

Usage

Use this module when you need to run PPO training with actor and critic on separate GPU pools to avoid memory contention. This is the entry point to invoke via Hydra config.

Code Reference

Source Location

Signature

class RewardManager:
    def __init__(self, tokenizer, num_examine: int) -> None:
        """
        Args:
            tokenizer: HuggingFace tokenizer for decoding sequences.
            num_examine: Number of decoded responses to print per data source.
        """

    def __call__(self, data: DataProto, return_dict: bool = False):
        """Compute reward scores for a batch of generated sequences.

        Args:
            data: DataProto containing prompts, responses, attention_mask,
                  and non_tensor_batch with reward_model ground truth.
            return_dict: If True, return {"reward_tensor": tensor}; else return tensor.

        Returns:
            torch.Tensor or dict: Reward scores placed at last valid response token.
        """

Import

# This module is run as a Hydra main entry point:
python examples/split_placement/main_ppo_split.py \
    --config-path=config --config-name=ppo_trainer_split

I/O Contract

Inputs

Name Type Required Description
config OmegaConf DictConfig Yes Hydra configuration with actor_rollout_ref, critic, trainer sections
tokenizer PreTrainedTokenizer Yes HuggingFace tokenizer (loaded from model path)
data DataProto Yes Batch containing prompts, responses, attention_mask, and reward ground truth

Outputs

Name Type Description
reward_tensor torch.Tensor Float tensor of shape [batch_size, response_length] with reward at last valid token
checkpoints files Model checkpoints saved to configured output directory

Usage Examples

Running Split Placement PPO

# Launch split placement PPO training via command line:
# python examples/split_placement/main_ppo_split.py \
#     --config-path=config --config-name=ppo_trainer_split

# The script automatically:
# 1. Initializes Ray cluster
# 2. Creates two separate GPU resource pools (actor vs critic)
# 3. Monkey-patches RayPPOTrainer.fit with parallel update logic
# 4. Runs the training loop

Using RewardManager Directly

from examples.split_placement.main_ppo_split import RewardManager

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")
reward_mgr = RewardManager(tokenizer=tokenizer, num_examine=2)

# data is a DataProto with prompts, responses, attention_mask
reward_tensor = reward_mgr(data)
# reward_tensor shape: [batch_size, response_length]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment