Principle:ContextualAI HALOs Bradley Terry Reward Modeling
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, NLP, Reinforcement_Learning |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
A preference learning method that trains a neural network to predict which of two responses a human would prefer, using the Bradley-Terry probabilistic model of pairwise comparisons.
Description
The Bradley-Terry model is a classic statistical model for pairwise comparison data. In the context of LLM alignment, it is used to train a reward model that assigns scalar scores to model outputs. Given a pair of responses (chosen, rejected) to the same prompt, the model learns to assign a higher score to the chosen response.
The resulting reward model serves as a proxy for human judgment and is used in two ways:
- As the reward signal in PPO training
- As the labeling function in online iterative alignment (scoring model completions for DPO/KTO feedback construction)
Unlike DPO which implicitly models preferences through the language model itself, the Bradley-Terry approach trains a separate model dedicated to scoring, allowing it to be reused across multiple training rounds and methods.
Usage
Train a Bradley-Terry reward model when you need a reusable reward scorer for online iterative alignment or PPO training. Requires paired preference data (response A preferred over response B).
Theoretical Basis
The Bradley-Terry model defines the probability that response A is preferred over response B:
Where is the reward model and is the sigmoid function.
The training loss is the binary cross-entropy:
Where is the preferred (chosen) response and is the dispreferred (rejected) response.
The key metric for evaluating a reward model is reward accuracy: the fraction of held-out preference pairs where the model assigns a higher score to the chosen response.