Principle:OpenRLHF OpenRLHF Reward Model Training Loop
| Knowledge Sources | |
|---|---|
| Domains | NLP, Reward_Modeling, Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A training procedure that learns a scalar reward function from human preference data by training a model to assign higher scores to preferred responses.
Description
Reward Model Training learns a scoring function that predicts human preferences. Given pairs of (chosen, rejected) responses to the same prompt, the model is trained so the chosen response receives a higher scalar reward than the rejected one. The reward model serves as a proxy for human judgment in subsequent PPO training.
The training uses a pairwise ranking loss (Bradley-Terry model) that maximizes the log-probability of the correct ranking. The trainer also tracks reward accuracy and computes reward statistics (mean, std) for normalization.
Usage
Use after SFT when human preference data is available. The trained reward model is used for PPO training, rejection sampling scoring, and iterative DPO preference pair generation.
Theoretical Basis
The Bradley-Terry model defines the probability of preferring response over :
The loss function is the negative log-likelihood:
OpenRLHF supports two loss variants:
- PairWiseLoss (sigmoid): Standard Bradley-Terry with optional margin
- LogExpLoss: