Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenRLHF OpenRLHF Broadcast to vllm

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Training_Infrastructure
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for synchronizing policy weights from DeepSpeed training to vLLM inference engines provided by OpenRLHF.

Description

The weight broadcast mechanism gathers the full model state dict from DeepSpeed ZeRO-sharded training workers, then loads it into each vLLM engine's model via Ray remote calls. For LoRA models, only the adapter weights are transferred. The sync happens via NCCL or Ray object store depending on configuration.

Usage

Called by PPOTrainer.fit() after each PPO training epoch, before the next generation round.

Code Reference

Source Location

  • Repository: OpenRLHF
  • File: openrlhf/trainer/ray/ppo_actor.py (broadcast method)

Signature

def broadcast_to_vllm(self) -> None:
    """
    Broadcast updated policy weights to vLLM engines.

    Steps:
    1. Gather full state dict from DeepSpeed ZeRO-3
    2. For each vLLM engine:
       - Send updated weights via Ray
       - Engine loads weights into its model
    3. Synchronize to ensure all engines are updated

    Side Effects:
        - vLLM engines' models updated with latest policy weights
    """

Import

# Called internally by PPOTrainer, not directly imported
# Located in: openrlhf/trainer/ray/ppo_actor.py

I/O Contract

Inputs

Name Type Required Description
(self) ActorPPOTrainer Yes Actor trainer with access to model and vLLM refs

Outputs

Name Type Description
(side effect) None vLLM engines updated with latest policy weights

Usage Examples

# Called within PPO training loop (simplified)
for episode in range(num_episodes):
    # 1. Generate samples with vLLM
    samples = vllm_generate(prompts)

    # 2. Score and compute advantages
    rewards = reward_model(samples)
    advantages = compute_gae(rewards, values)

    # 3. PPO training update
    actor_trainer.ppo_train(experience)

    # 4. Sync weights to vLLM
    actor_trainer.broadcast_to_vllm()

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment