Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl Split Placement Fit

From Leeroopedia
Revision as of 17:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Volcengine_Verl_Split_Placement_Fit.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Reinforcement_Learning, Distributed_Training, Optimization
Last Updated 2026-02-07 18:00 GMT

Overview

Concrete tool for running a modified PPO training loop with parallel actor and critic updates on separate resource pools, provided by the verl framework.

Description

The split_monkey_patch.py module provides a replacement fit() function for RayPPOTrainer that enables parallel execution of actor and critic update steps. In the standard verl training loop, actor and critic updates run sequentially. This implementation exploits the split resource placement (actor and critic on separate GPU pools) to overlap these two heavy compute phases.

Key differences from the standard fit() loop:

  • Actor update (update_actor) and critic update (update_critic) are issued concurrently using non-blocking RPC calls
  • The function then waits for both to complete using .get() calls
  • This parallelization can significantly reduce wall-clock training time when actor and critic reside on different hardware

The function is designed to be monkey-patched onto RayPPOTrainer via RayPPOTrainer.fit = fit.

Usage

Use this modified fit function when running PPO with split placement (actor and critic on separate GPU resource pools). It is automatically applied by main_ppo_split.py and should not be imported independently.

Code Reference

Source Location

Signature

def fit(self):
    """
    The training loop of PPO with parallel actor/critic updates.

    The driver process calls compute functions of the worker group through RPC
    to construct the PPO dataflow. Light-weight advantage computation is done
    on the driver process.

    Key difference from standard fit(): actor and critic updates run in parallel
    using non-blocking RPC calls when split placement is active.
    """

Import

from split_monkey_patch import fit

# Applied via monkey-patching:
RayPPOTrainer.fit = fit

I/O Contract

Inputs

Name Type Required Description
self RayPPOTrainer Yes The trainer instance (monkey-patched method)
self.config OmegaConf Yes Full training configuration
self.actor_rollout_wg WorkerGroup Yes Actor rollout worker group
self.critic_wg WorkerGroup Yes (if use_critic) Critic worker group on separate pool
self.train_dataloader DataLoader Yes Training data iterator

Outputs

Name Type Description
metrics dict Training metrics logged each step (actor loss, critic loss, rewards, timing)
checkpoints files Saved at configured save_freq intervals
val_metrics dict Validation metrics at configured test_freq intervals

Usage Examples

Applying the Monkey Patch

from split_monkey_patch import fit
from verl.trainer.ppo.ray_trainer import RayPPOTrainer

# Monkey-patch the fit method to enable parallel updates
RayPPOTrainer.fit = fit

# Create trainer with split resource pools
trainer = RayPPOTrainer(
    config=config,
    tokenizer=tokenizer,
    role_worker_mapping=role_worker_mapping,
    resource_pool_manager=resource_pool_manager,
    ray_worker_group_cls=ray_worker_group_cls,
    reward_fn=reward_fn,
    val_reward_fn=val_reward_fn,
)
trainer.init_workers()
trainer.fit()  # Uses the parallel actor/critic update loop

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment