Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenRLHF OpenRLHF PPO Training Loop

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Training
Last Updated 2026-02-07 00:00 GMT

Overview

A multi-model training orchestrator that coordinates on-policy generation, reward scoring, advantage estimation, and policy/value function updates in the PPO-RLHF loop.

Description

PPO Training Loop orchestrates the complex interaction between multiple models in RLHF:

  1. Generation: vLLM generates responses from prompts using the current policy
  2. Scoring: Reference model and reward model score the generated responses
  3. Experience Making: KL penalties, advantages (GAE), and returns are computed
  4. Training: Actor (policy) and Critic (value function) are updated using PPO objectives
  5. Weight Sync: Updated policy weights are broadcast to vLLM engines

This cycle repeats for each batch of prompts.

Usage

Use for PPO-based RLHF training with a trained reward model, or for GRPO with rule-based rewards (no critic). Requires Ray cluster with multiple GPU groups.

Theoretical Basis

PPO-RLHF combines:

  • On-policy generation: Fresh samples from current policy
  • GAE (Generalized Advantage Estimation): At=l=0(γλ)lδt+l
  • Clipped policy gradient: Conservative actor updates
  • Clipped value function: Stable critic updates
  • KL penalty: Prevents excessive divergence from reference

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment