Principle:Allenai Open instruct Reward Model Training
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning from Human Feedback, Reward Modeling, Preference Learning, Natural Language Processing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Reward model training is the process of fitting a parameterized scalar reward function to human preference data using the Bradley-Terry pairwise comparison model, enabling the reward model to score text completions in a manner consistent with human judgments.
Description
In RLHF pipelines, a reward model serves as a proxy for human preferences. Training this model requires a dataset of preference pairs: for a given prompt, a human annotator (or an AI judge) selects a "chosen" completion over a "rejected" completion. The reward model learns to assign higher scalar scores to chosen completions than to rejected ones.
The training objective is derived from the Bradley-Terry model of pairwise comparisons, a classical probabilistic framework from psychometrics. Under this model, the probability that completion is preferred over completion given prompt is:
where is the sigmoid function and is the reward model. The training loss is the negative log-likelihood of the observed preferences:
This is equivalent to binary cross-entropy where the reward model must predict which of two completions is preferred, but formulated through reward differences rather than absolute scores. This difference-based formulation has important properties:
- Translation invariance: Adding a constant to all rewards does not change the loss, so the model learns only the relative quality of completions.
- Calibrated probabilities: The sigmoid of the reward margin gives a well-calibrated probability of preference.
- Smooth gradients: The loss is smooth and convex in the reward difference, providing stable optimization dynamics.
Key Training Considerations
Dropout disabling: Following Stiennon et al. (2020), dropout is disabled throughout the reward model during training. Since the reward model is used deterministically (not for sampling), stochastic dropout introduces unnecessary noise in reward predictions.
Gradient accumulation: Preference pair training benefits from larger effective batch sizes. Open Instruct supports gradient accumulation across multiple micro-batches to achieve larger batch sizes without increasing per-GPU memory requirements.
Distributed training: The training loop supports multi-GPU training via HuggingFace Accelerate and DeepSpeed, with metrics aggregated across all processes.
Usage
Use reward model training when:
- Building the reward component of an RLHF pipeline (e.g., for PPO or GRPO).
- You have a dataset of human preference pairs (chosen/rejected completions for given prompts).
- You need a learned reward function to replace or supplement rule-based or verifiable rewards.
- Fine-tuning an existing reward model on new preference data from a specific domain or task.
Theoretical Basis
Bradley-Terry Preference Model
The Bradley-Terry model posits that for items and with associated strengths and :
This can be derived from the assumption that each item's perceived quality is its strength plus independent Gumbel noise. Under this noise model, the difference in perceived qualities follows a logistic distribution, yielding the sigmoid probability.
Loss Function
Given a dataset of preference triples (prompt, chosen, rejected), the training loss is:
The gradient with respect to the reward model parameters is:
where .
Note that the gradient weighting factor is large when the model assigns similar rewards to chosen and rejected (the model is uncertain), and small when the model is already confident in the correct ranking. This provides a natural form of hard example mining.
Training Metrics
Key metrics monitored during training include:
- Accuracy: Fraction of preference pairs where . Random performance is 50%.
- Loss: The Bradley-Terry negative log-likelihood. Perfect accuracy corresponds to loss approaching 0, while random guessing gives .
- Chosen/Rejected rewards: Mean reward scores for chosen and rejected completions. The gap between these (the reward margin) should increase during training.
- Reward margin: , which should be positive and growing.
Pseudocode
for each epoch:
for each batch of (chosen, rejected) pairs:
1. Concatenate chosen and rejected sequences
2. Forward pass through reward model to get scalar rewards
3. Split rewards back into chosen_rewards and rejected_rewards
4. Compute accuracy = mean(chosen_rewards > rejected_rewards)
5. Compute loss = -mean(log_sigmoid(chosen_rewards - rejected_rewards))
6. Backward pass and optimizer step
7. Log metrics: accuracy, loss, chosen/rejected scores, reward margin