Principle:OpenRLHF OpenRLHF Pairwise Ranking Loss

Knowledge Sources	Training language models to follow instructions with human feedback Scaling Laws for Reward Model Overoptimization
Domains	Reward_Modeling, Loss_Functions
Last Updated	2026-02-07 00:00 GMT

Overview

A loss function based on the Bradley-Terry model that trains reward models to assign higher scores to human-preferred responses.

Description

Pairwise Ranking Loss implements the Bradley-Terry preference model for reward model training. Given a pair of responses where one is preferred (chosen) and one is not (rejected), the loss pushes the reward model to produce a higher scalar score for the chosen response. OpenRLHF provides two variants: PairWiseLoss (log-sigmoid) and LogExpLoss (log-exponential).

Usage

Used internally by RewardModelTrainer. Select "sigmoid" for standard PairWiseLoss or "logexp" for the LogExpLoss variant.

Theoretical Basis

PairWiseLoss (sigmoid): $L = - \log σ (r_{c h o s e n} - r_{r e j e c t e d} - m)$ where $m$ is an optional margin.

LogExpLoss: $L = \log (1 + \exp (r_{r e j e c t e d} - r_{c h o s e n}))$

Both losses are minimized when $r_{c h o s e n} > r_{r e j e c t e d}$ .

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_PairWiseLoss

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment