Workflow:OpenRLHF OpenRLHF PPO Ray Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RLHF, PPO, Distributed_Training |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
End-to-end process for training a language model policy using Proximal Policy Optimization (PPO) with Ray distributed orchestration, vLLM inference acceleration, and DeepSpeed training optimization.
Description
This workflow implements the core RL stage of the RLHF pipeline. It orchestrates four distributed model groups (Actor, Critic, Reference, Reward) across multiple GPU nodes using Ray. The Actor generates responses via vLLM engines, the Reward model scores them, the Critic estimates values for advantage computation, and the Reference model provides KL divergence regularization. PPO training updates the Actor and Critic using clipped surrogate objectives with GAE advantages. The hybrid engine mode enables GPU memory sharing between training and inference phases for maximum hardware utilization.
Usage
Execute this workflow when you have a trained SFT model and reward model, and want to optimize the policy through reinforcement learning. This is the third and final stage in the canonical RLHF pipeline. It requires multi-GPU infrastructure (typically 3-4 nodes with 8 GPUs each). Supports PPO, REINFORCE++, GRPO, RLOO, and Dr. GRPO advantage estimators. Can use either a pretrained reward model or custom reward functions.
Execution Steps
Step 1: Initialize Ray cluster and placement groups
Start the Ray runtime and configure resource placement groups for model colocation. Define how many nodes and GPUs each model group (Actor, Critic, Reference, Reward) receives. Set up placement constraints for models that share GPUs (e.g., Actor+Reference colocated, Critic+Reward colocated).
Key considerations:
- Placement groups prevent resource contention between model groups
- Colocation modes allow multiple models to share GPU memory via sleep/wake scheduling
- The hybrid engine enables all four models to share the same GPU set
Step 2: Create vLLM inference engines
Initialize vLLM engines for fast response generation during rollouts. Configure tensor parallelism, prefix caching, and dynamic batching. These engines run the Actor model weights for generation but are separate from the training copy.
Key considerations:
- vLLM provides 3-10x faster generation compared to HuggingFace generate
- Tensor parallelism size determines how many GPUs each engine spans
- The sleep mode allows vLLM to release GPU memory during training phases
- Weight sync transfers updated Actor weights to vLLM after each training step
Step 3: Initialize distributed model groups
Create Ray actor groups for each model role. Each group initializes its model using DeepSpeed strategy, loads pretrained weights, and sets up the training or inference pipeline. The four groups are:
Actor model group: Loads the SFT policy model for training with DeepSpeed ZeRO-3.
Reference model group: Loads a frozen copy of the initial SFT model for computing KL divergence.
Reward model group: Loads the trained reward model for scoring generated responses. Alternatively, specifies a custom reward function script.
Critic model group: Loads a value model (often initialized from the reward model) for estimating state values and computing GAE advantages.
Step 4: Generate rollout experiences
For each training iteration, sample a batch of prompts and generate responses using the vLLM engines. The experience maker then processes the generated sequences through all four models: compute log-probabilities from Actor and Reference, reward scores from the Reward model, and value estimates from the Critic. Assemble complete experience tuples for PPO training.
Key considerations:
- Rollout batch size determines how many prompts are sampled per iteration
- Multiple samples per prompt (n_samples_per_prompt) enable group-based advantages (GRPO)
- KL divergence between Actor and Reference prevents policy collapse
- Dynamic filtering can remove low-quality samples before training
Step 5: Compute advantages and train policy
Using the collected experiences, compute GAE (Generalized Advantage Estimation) advantages from rewards and value estimates. Then perform multiple PPO training epochs on the experience buffer, updating both the Actor (policy) and Critic (value) networks. The Actor uses the clipped surrogate PPO objective, and the Critic minimizes value prediction error.
Key considerations:
- Multiple PPO epochs per rollout batch improves sample efficiency
- Clipped surrogate objective prevents destructively large policy updates
- Importance sampling correction (ICEPOP) handles staleness in experience buffer
- Reward normalization stabilizes training across different reward scales
Step 6: Synchronize weights to vLLM
After training updates, broadcast the updated Actor weights from the DeepSpeed training copy to the vLLM inference engines. This uses NCCL collective operations or CUDA IPC for fast weight transfer.
Key considerations:
- Weight sync must complete before the next rollout generation begins
- NCCL broadcast is used for multi-node setups
- CUDA IPC is faster for single-node configurations
Step 7: Save trained models
After all training episodes complete, save the final Actor model (the aligned policy) and optionally the Critic model. The Actor model is the primary output used for deployment or further refinement.
Key considerations:
- Save both Actor (policy) and Critic (value) if continuing training later
- The trained Actor can be used directly for inference or as input to iterative refinement