Principle:OpenRLHF OpenRLHF Reward Model Training Loop

Knowledge Sources	Training language models to follow instructions with human feedback A General Language Assistant as a Laboratory for Alignment
Domains	NLP, Reward_Modeling, Training
Last Updated	2026-02-07 00:00 GMT

Overview

A training procedure that learns a scalar reward function from human preference data by training a model to assign higher scores to preferred responses.

Description

Reward Model Training learns a scoring function that predicts human preferences. Given pairs of (chosen, rejected) responses to the same prompt, the model is trained so the chosen response receives a higher scalar reward than the rejected one. The reward model serves as a proxy for human judgment in subsequent PPO training.

The training uses a pairwise ranking loss (Bradley-Terry model) that maximizes the log-probability of the correct ranking. The trainer also tracks reward accuracy and computes reward statistics (mean, std) for normalization.

Usage

Use after SFT when human preference data is available. The trained reward model is used for PPO training, rejection sampling scoring, and iterative DPO preference pair generation.

Theoretical Basis

The Bradley-Terry model defines the probability of preferring response $y_{w}$ over $y_{l}$ : $P (y_{w} ≻ y_{l} | x) = σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l}))$

The loss function is the negative log-likelihood: $L = - 𝔼 [\log σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l}))]$

OpenRLHF supports two loss variants:

PairWiseLoss (sigmoid): Standard Bradley-Terry with optional margin
LogExpLoss: $L = \log (1 + \exp (r_{r e j e c t} - r_{c h o s e n}))$

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_RewardModelTrainer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment