Implementation:OpenRLHF OpenRLHF PPOTrainer fit
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for orchestrating the full PPO-RLHF training loop with Ray distributed actors provided by OpenRLHF.
Description
The PPOTrainer.fit() method implements the outer PPO training loop. For each batch of prompts, it: (1) generates responses via vLLM engines, (2) computes rewards and KL penalties via reference/reward models, (3) runs GAE advantage estimation, (4) trains the actor using PolicyLoss for multiple epochs, (5) trains the critic using ValueLoss, and (6) broadcasts updated weights to vLLM engines. All model operations are distributed across Ray actors.
Usage
Called after PPOTrainer initialization with all Ray actor groups configured. Runs until all prompts are exhausted.
Code Reference
Source Location
- Repository: OpenRLHF
- File: openrlhf/trainer/ppo_trainer.py
Signature
class PPOTrainer:
def fit(
self,
args,
consumed_samples: int = 0,
num_update_steps_per_episode: int = None,
) -> None:
"""
Run the full PPO training loop.
For each batch:
1. Generate responses from prompts via vLLM
2. Score with reward model
3. Compute KL, advantages, returns
4. Train actor (PolicyLoss) for ppo_epochs
5. Train critic (ValueLoss) for ppo_epochs
6. Broadcast weights to vLLM
"""
Import
from openrlhf.trainer import PPOTrainer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | Namespace | Yes | Full PPO training configuration |
| consumed_samples | int | No | Resume from checkpoint offset |
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | None | Actor and critic updated in-place |
| logs | Dict | PPO metrics logged to W&B/TensorBoard |
| checkpoints | Files | Model weights saved periodically |
Usage Examples
from openrlhf.trainer import PPOTrainer
ppo_trainer = PPOTrainer(
actor_model_group=actor_group,
critic_model_group=critic_group,
reward_model_group=reward_group,
ref_model_group=ref_group,
vllm_engines=vllm_engines,
strategy=strategy,
args=args,
)
ppo_trainer.fit(args)
Related Pages
Implements Principle
Requires Environment
- Environment:OpenRLHF_OpenRLHF_CUDA_GPU_Environment
- Environment:OpenRLHF_OpenRLHF_vLLM_Environment
- Environment:OpenRLHF_OpenRLHF_Ray_Distributed_Environment