Principle:OpenRLHF OpenRLHF PPO Training Loop
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A multi-model training orchestrator that coordinates on-policy generation, reward scoring, advantage estimation, and policy/value function updates in the PPO-RLHF loop.
Description
PPO Training Loop orchestrates the complex interaction between multiple models in RLHF:
- Generation: vLLM generates responses from prompts using the current policy
- Scoring: Reference model and reward model score the generated responses
- Experience Making: KL penalties, advantages (GAE), and returns are computed
- Training: Actor (policy) and Critic (value function) are updated using PPO objectives
- Weight Sync: Updated policy weights are broadcast to vLLM engines
This cycle repeats for each batch of prompts.
Usage
Use for PPO-based RLHF training with a trained reward model, or for GRPO with rule-based rewards (no critic). Requires Ray cluster with multiple GPU groups.
Theoretical Basis
PPO-RLHF combines:
- On-policy generation: Fresh samples from current policy
- GAE (Generalized Advantage Estimation):
- Clipped policy gradient: Conservative actor updates
- Clipped value function: Stable critic updates
- KL penalty: Prevents excessive divergence from reference
Related Pages
Implemented By
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment