Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ContextualAI HALOs Bradley Terry Reward Modeling

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, NLP, Reinforcement_Learning
Last Updated 2026-02-08 03:00 GMT

Overview

A preference learning method that trains a neural network to predict which of two responses a human would prefer, using the Bradley-Terry probabilistic model of pairwise comparisons.

Description

The Bradley-Terry model is a classic statistical model for pairwise comparison data. In the context of LLM alignment, it is used to train a reward model that assigns scalar scores to model outputs. Given a pair of responses (chosen, rejected) to the same prompt, the model learns to assign a higher score to the chosen response.

The resulting reward model serves as a proxy for human judgment and is used in two ways:

  1. As the reward signal in PPO training
  2. As the labeling function in online iterative alignment (scoring model completions for DPO/KTO feedback construction)

Unlike DPO which implicitly models preferences through the language model itself, the Bradley-Terry approach trains a separate model dedicated to scoring, allowing it to be reused across multiple training rounds and methods.

Usage

Train a Bradley-Terry reward model when you need a reusable reward scorer for online iterative alignment or PPO training. Requires paired preference data (response A preferred over response B).

Theoretical Basis

The Bradley-Terry model defines the probability that response A is preferred over response B:

P(AB)=σ(rϕ(x,A)rϕ(x,B))

Where rϕ is the reward model and σ is the sigmoid function.

The training loss is the binary cross-entropy:

BT=logσ(rϕ(x,yw)rϕ(x,yl))

Where yw is the preferred (chosen) response and yl is the dispreferred (rejected) response.

The key metric for evaluating a reward model is reward accuracy: the fraction of held-out preference pairs where the model assigns a higher score to the chosen response.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment