Principle:Allenai Open instruct Reward Model Training

Knowledge Sources	Learning to summarize from human feedback Training language models to follow instructions with human feedback A General Theoretical Paradigm to Understand Learning from Human Feedback Fine-Tuning Language Models from Human Preferences
Domains	Reinforcement Learning from Human Feedback, Reward Modeling, Preference Learning, Natural Language Processing
Last Updated	2026-02-07 00:00 GMT

Overview

Reward model training is the process of fitting a parameterized scalar reward function to human preference data using the Bradley-Terry pairwise comparison model, enabling the reward model to score text completions in a manner consistent with human judgments.

Description

In RLHF pipelines, a reward model serves as a proxy for human preferences. Training this model requires a dataset of preference pairs: for a given prompt, a human annotator (or an AI judge) selects a "chosen" completion over a "rejected" completion. The reward model learns to assign higher scalar scores to chosen completions than to rejected ones.

The training objective is derived from the Bradley-Terry model of pairwise comparisons, a classical probabilistic framework from psychometrics. Under this model, the probability that completion $y_{w}$ is preferred over completion $y_{l}$ given prompt $x$ is:

$P (y_{w} ≻ y_{l} ∣ x) = σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l}))$

where $σ$ is the sigmoid function and $r_{θ}$ is the reward model. The training loss is the negative log-likelihood of the observed preferences:

$ℒ (θ) = - 𝔼_{(x, y_{w}, y_{l}) \sim 𝒟} [\log σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l}))]$

This is equivalent to binary cross-entropy where the reward model must predict which of two completions is preferred, but formulated through reward differences rather than absolute scores. This difference-based formulation has important properties:

Translation invariance: Adding a constant to all rewards does not change the loss, so the model learns only the relative quality of completions.
Calibrated probabilities: The sigmoid of the reward margin gives a well-calibrated probability of preference.
Smooth gradients: The loss is smooth and convex in the reward difference, providing stable optimization dynamics.

Key Training Considerations

Dropout disabling: Following Stiennon et al. (2020), dropout is disabled throughout the reward model during training. Since the reward model is used deterministically (not for sampling), stochastic dropout introduces unnecessary noise in reward predictions.

Gradient accumulation: Preference pair training benefits from larger effective batch sizes. Open Instruct supports gradient accumulation across multiple micro-batches to achieve larger batch sizes without increasing per-GPU memory requirements.

Distributed training: The training loop supports multi-GPU training via HuggingFace Accelerate and DeepSpeed, with metrics aggregated across all processes.

Usage

Use reward model training when:

Building the reward component of an RLHF pipeline (e.g., for PPO or GRPO).
You have a dataset of human preference pairs (chosen/rejected completions for given prompts).
You need a learned reward function to replace or supplement rule-based or verifiable rewards.
Fine-tuning an existing reward model on new preference data from a specific domain or task.

Theoretical Basis

Bradley-Terry Preference Model

The Bradley-Terry model posits that for items $i$ and $j$ with associated strengths $r_{i}$ and $r_{j}$ :

$P (i ≻ j) = \frac{e^{r_{i}}}{e^{r_{i}} + e^{r_{j}}} = σ (r_{i} - r_{j})$

This can be derived from the assumption that each item's perceived quality is its strength plus independent Gumbel noise. Under this noise model, the difference in perceived qualities follows a logistic distribution, yielding the sigmoid probability.

Loss Function

Given a dataset $𝒟 = {(x^{(k)}, y_{w}^{(k)}, y_{l}^{(k)})}_{k = 1}^{N}$ of preference triples (prompt, chosen, rejected), the training loss is:

$ℒ (θ) = - \frac{1}{N} \sum_{k = 1}^{N} \log σ (r_{θ} (x^{(k)}, y_{w}^{(k)}) - r_{θ} (x^{(k)}, y_{l}^{(k)}))$

The gradient with respect to the reward model parameters is:

$\nabla_{θ} ℒ = - \frac{1}{N} \sum_{k = 1}^{N} (1 - σ (Δ r^{(k)})) (\nabla_{θ} r_{θ} (x^{(k)}, y_{w}^{(k)}) - \nabla_{θ} r_{θ} (x^{(k)}, y_{l}^{(k)}))$

where $Δ r^{(k)} = r_{θ} (x^{(k)}, y_{w}^{(k)}) - r_{θ} (x^{(k)}, y_{l}^{(k)})$ .

Note that the gradient weighting factor $1 - σ (Δ r^{(k)})$ is large when the model assigns similar rewards to chosen and rejected (the model is uncertain), and small when the model is already confident in the correct ranking. This provides a natural form of hard example mining.

Training Metrics

Key metrics monitored during training include:

Accuracy: Fraction of preference pairs where $r_{θ} (x, y_{w}) > r_{θ} (x, y_{l})$ . Random performance is 50%.
Loss: The Bradley-Terry negative log-likelihood. Perfect accuracy corresponds to loss approaching 0, while random guessing gives $\log 2 \approx 0.693$ .
Chosen/Rejected rewards: Mean reward scores for chosen and rejected completions. The gap between these (the reward margin) should increase during training.
Reward margin: $𝔼 [r_{θ} (x, y_{w}) - r_{θ} (x, y_{l})]$ , which should be positive and growing.

Pseudocode

for each epoch:
    for each batch of (chosen, rejected) pairs:
        1. Concatenate chosen and rejected sequences
        2. Forward pass through reward model to get scalar rewards
        3. Split rewards back into chosen_rewards and rejected_rewards
        4. Compute accuracy = mean(chosen_rewards > rejected_rewards)
        5. Compute loss = -mean(log_sigmoid(chosen_rewards - rejected_rewards))
        6. Backward pass and optimizer step
        7. Log metrics: accuracy, loss, chosen/rejected scores, reward margin

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_Reward_Modeling_Main

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment