Implementation:OpenRLHF OpenRLHF PPOTrainer fit

Knowledge Sources	OpenRLHF
Domains	Reinforcement_Learning, Training
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for orchestrating the full PPO-RLHF training loop with Ray distributed actors provided by OpenRLHF.

Description

The PPOTrainer.fit() method implements the outer PPO training loop. For each batch of prompts, it: (1) generates responses via vLLM engines, (2) computes rewards and KL penalties via reference/reward models, (3) runs GAE advantage estimation, (4) trains the actor using PolicyLoss for multiple epochs, (5) trains the critic using ValueLoss, and (6) broadcasts updated weights to vLLM engines. All model operations are distributed across Ray actors.

Usage

Called after PPOTrainer initialization with all Ray actor groups configured. Runs until all prompts are exhausted.

Code Reference

Source Location

Repository: OpenRLHF
File: openrlhf/trainer/ppo_trainer.py

Signature

class PPOTrainer:
    def fit(
        self,
        args,
        consumed_samples: int = 0,
        num_update_steps_per_episode: int = None,
    ) -> None:
        """
        Run the full PPO training loop.

        For each batch:
        1. Generate responses from prompts via vLLM
        2. Score with reward model
        3. Compute KL, advantages, returns
        4. Train actor (PolicyLoss) for ppo_epochs
        5. Train critic (ValueLoss) for ppo_epochs
        6. Broadcast weights to vLLM
        """

Import

from openrlhf.trainer import PPOTrainer

I/O Contract

Inputs

Name	Type	Required	Description
args	Namespace	Yes	Full PPO training configuration
consumed_samples	int	No	Resume from checkpoint offset

Outputs

Name	Type	Description
(side effect)	None	Actor and critic updated in-place
logs	Dict	PPO metrics logged to W&B/TensorBoard
checkpoints	Files	Model weights saved periodically

Usage Examples

from openrlhf.trainer import PPOTrainer

ppo_trainer = PPOTrainer(
    actor_model_group=actor_group,
    critic_model_group=critic_group,
    reward_model_group=reward_group,
    ref_model_group=ref_group,
    vllm_engines=vllm_engines,
    strategy=strategy,
    args=args,
)

ppo_trainer.fit(args)

Related Pages

Implements Principle

Principle:OpenRLHF_OpenRLHF_PPO_Training_Loop

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment