Workflow:OpenRLHF OpenRLHF PPO Ray Training

Knowledge Sources	OpenRLHF Ray vLLM DeepSpeed
Domains	LLMs, RLHF, PPO, Distributed_Training
Last Updated	2026-02-07 10:00 GMT

Overview

End-to-end process for training a language model policy using Proximal Policy Optimization (PPO) with Ray distributed orchestration, vLLM inference acceleration, and DeepSpeed training optimization.

Description

This workflow implements the core RL stage of the RLHF pipeline. It orchestrates four distributed model groups (Actor, Critic, Reference, Reward) across multiple GPU nodes using Ray. The Actor generates responses via vLLM engines, the Reward model scores them, the Critic estimates values for advantage computation, and the Reference model provides KL divergence regularization. PPO training updates the Actor and Critic using clipped surrogate objectives with GAE advantages. The hybrid engine mode enables GPU memory sharing between training and inference phases for maximum hardware utilization.

Usage

Execute this workflow when you have a trained SFT model and reward model, and want to optimize the policy through reinforcement learning. This is the third and final stage in the canonical RLHF pipeline. It requires multi-GPU infrastructure (typically 3-4 nodes with 8 GPUs each). Supports PPO, REINFORCE++, GRPO, RLOO, and Dr. GRPO advantage estimators. Can use either a pretrained reward model or custom reward functions.

Execution Steps

Step 1: Initialize Ray cluster and placement groups

Start the Ray runtime and configure resource placement groups for model colocation. Define how many nodes and GPUs each model group (Actor, Critic, Reference, Reward) receives. Set up placement constraints for models that share GPUs (e.g., Actor+Reference colocated, Critic+Reward colocated).

Key considerations:

Placement groups prevent resource contention between model groups
Colocation modes allow multiple models to share GPU memory via sleep/wake scheduling
The hybrid engine enables all four models to share the same GPU set

Step 2: Create vLLM inference engines

Initialize vLLM engines for fast response generation during rollouts. Configure tensor parallelism, prefix caching, and dynamic batching. These engines run the Actor model weights for generation but are separate from the training copy.

Key considerations:

vLLM provides 3-10x faster generation compared to HuggingFace generate
Tensor parallelism size determines how many GPUs each engine spans
The sleep mode allows vLLM to release GPU memory during training phases
Weight sync transfers updated Actor weights to vLLM after each training step

Step 3: Initialize distributed model groups

Create Ray actor groups for each model role. Each group initializes its model using DeepSpeed strategy, loads pretrained weights, and sets up the training or inference pipeline. The four groups are:

Actor model group: Loads the SFT policy model for training with DeepSpeed ZeRO-3.

Reference model group: Loads a frozen copy of the initial SFT model for computing KL divergence.

Reward model group: Loads the trained reward model for scoring generated responses. Alternatively, specifies a custom reward function script.

Critic model group: Loads a value model (often initialized from the reward model) for estimating state values and computing GAE advantages.

Step 4: Generate rollout experiences

For each training iteration, sample a batch of prompts and generate responses using the vLLM engines. The experience maker then processes the generated sequences through all four models: compute log-probabilities from Actor and Reference, reward scores from the Reward model, and value estimates from the Critic. Assemble complete experience tuples for PPO training.

Key considerations:

Rollout batch size determines how many prompts are sampled per iteration
Multiple samples per prompt (n_samples_per_prompt) enable group-based advantages (GRPO)
KL divergence between Actor and Reference prevents policy collapse
Dynamic filtering can remove low-quality samples before training

Step 5: Compute advantages and train policy

Using the collected experiences, compute GAE (Generalized Advantage Estimation) advantages from rewards and value estimates. Then perform multiple PPO training epochs on the experience buffer, updating both the Actor (policy) and Critic (value) networks. The Actor uses the clipped surrogate PPO objective, and the Critic minimizes value prediction error.

Key considerations:

Multiple PPO epochs per rollout batch improves sample efficiency
Clipped surrogate objective prevents destructively large policy updates
Importance sampling correction (ICEPOP) handles staleness in experience buffer
Reward normalization stabilizes training across different reward scales

Step 6: Synchronize weights to vLLM

After training updates, broadcast the updated Actor weights from the DeepSpeed training copy to the vLLM inference engines. This uses NCCL collective operations or CUDA IPC for fast weight transfer.

Key considerations:

Weight sync must complete before the next rollout generation begins
NCCL broadcast is used for multi-node setups
CUDA IPC is faster for single-node configurations

Step 7: Save trained models

After all training episodes complete, save the final Actor model (the aligned policy) and optionally the Critic model. The Actor model is the primary output used for deployment or further refinement.

Key considerations:

Save both Actor (policy) and Critic (value) if continuing training later
The trained Actor can be used directly for inference or as input to iterative refinement

Execution Diagram

GitHub URL

Workflow Repository