Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenRLHF OpenRLHF Reward Model Training Loop

From Leeroopedia


Knowledge Sources
Domains NLP, Reward_Modeling, Training
Last Updated 2026-02-07 00:00 GMT

Overview

A training procedure that learns a scalar reward function from human preference data by training a model to assign higher scores to preferred responses.

Description

Reward Model Training learns a scoring function that predicts human preferences. Given pairs of (chosen, rejected) responses to the same prompt, the model is trained so the chosen response receives a higher scalar reward than the rejected one. The reward model serves as a proxy for human judgment in subsequent PPO training.

The training uses a pairwise ranking loss (Bradley-Terry model) that maximizes the log-probability of the correct ranking. The trainer also tracks reward accuracy and computes reward statistics (mean, std) for normalization.

Usage

Use after SFT when human preference data is available. The trained reward model is used for PPO training, rejection sampling scoring, and iterative DPO preference pair generation.

Theoretical Basis

The Bradley-Terry model defines the probability of preferring response yw over yl: P(ywyl|x)=σ(rθ(x,yw)rθ(x,yl))

The loss function is the negative log-likelihood: L=𝔼[logσ(rθ(x,yw)rθ(x,yl))]

OpenRLHF supports two loss variants:

  • PairWiseLoss (sigmoid): Standard Bradley-Terry with optional margin
  • LogExpLoss: L=log(1+exp(rrejectrchosen))

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment