Implementation:OpenRLHF OpenRLHF Broadcast to vllm
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Training_Infrastructure |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for synchronizing policy weights from DeepSpeed training to vLLM inference engines provided by OpenRLHF.
Description
The weight broadcast mechanism gathers the full model state dict from DeepSpeed ZeRO-sharded training workers, then loads it into each vLLM engine's model via Ray remote calls. For LoRA models, only the adapter weights are transferred. The sync happens via NCCL or Ray object store depending on configuration.
Usage
Called by PPOTrainer.fit() after each PPO training epoch, before the next generation round.
Code Reference
Source Location
- Repository: OpenRLHF
- File: openrlhf/trainer/ray/ppo_actor.py (broadcast method)
Signature
def broadcast_to_vllm(self) -> None:
"""
Broadcast updated policy weights to vLLM engines.
Steps:
1. Gather full state dict from DeepSpeed ZeRO-3
2. For each vLLM engine:
- Send updated weights via Ray
- Engine loads weights into its model
3. Synchronize to ensure all engines are updated
Side Effects:
- vLLM engines' models updated with latest policy weights
"""
Import
# Called internally by PPOTrainer, not directly imported
# Located in: openrlhf/trainer/ray/ppo_actor.py
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (self) | ActorPPOTrainer | Yes | Actor trainer with access to model and vLLM refs |
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | None | vLLM engines updated with latest policy weights |
Usage Examples
# Called within PPO training loop (simplified)
for episode in range(num_episodes):
# 1. Generate samples with vLLM
samples = vllm_generate(prompts)
# 2. Score and compute advantages
rewards = reward_model(samples)
advantages = compute_gae(rewards, values)
# 3. PPO training update
actor_trainer.ppo_train(experience)
# 4. Sync weights to vLLM
actor_trainer.broadcast_to_vllm()
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment