Principle:ContextualAI HALOs Bradley Terry Reward Modeling

Knowledge Sources	Rank Analysis of Incomplete Block Designs (Bradley Terry) Learning to summarize from human feedback ContextualAI HALOs
Domains	Deep_Learning, NLP, Reinforcement_Learning
Last Updated	2026-02-08 03:00 GMT

Overview

A preference learning method that trains a neural network to predict which of two responses a human would prefer, using the Bradley-Terry probabilistic model of pairwise comparisons.

Description

The Bradley-Terry model is a classic statistical model for pairwise comparison data. In the context of LLM alignment, it is used to train a reward model that assigns scalar scores to model outputs. Given a pair of responses (chosen, rejected) to the same prompt, the model learns to assign a higher score to the chosen response.

The resulting reward model serves as a proxy for human judgment and is used in two ways:

As the reward signal in PPO training
As the labeling function in online iterative alignment (scoring model completions for DPO/KTO feedback construction)

Unlike DPO which implicitly models preferences through the language model itself, the Bradley-Terry approach trains a separate model dedicated to scoring, allowing it to be reused across multiple training rounds and methods.

Usage

Train a Bradley-Terry reward model when you need a reusable reward scorer for online iterative alignment or PPO training. Requires paired preference data (response A preferred over response B).

Theoretical Basis

The Bradley-Terry model defines the probability that response A is preferred over response B:

$P (A ≻ B) = σ (r_{ϕ} (x, A) - r_{ϕ} (x, B))$

Where $r_{ϕ}$ is the reward model and $σ$ is the sigmoid function.

The training loss is the binary cross-entropy:

$ℒ_{B T} = - \log σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))$

Where $y_{w}$ is the preferred (chosen) response and $y_{l}$ is the dispreferred (rejected) response.

The key metric for evaluating a reward model is reward accuracy: the fraction of held-out preference pairs where the model assigns a higher score to the chosen response.

Related Pages

Implemented By

Implementation:ContextualAI_HALOs_BradleyTerryTrainer_Train

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment