Principle:CarperAI Trlx Reward Model Architecture
| Knowledge Sources | |
|---|---|
| Domains | Reward_Modeling, NLP, Model_Architecture |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
An architectural principle for building reward models that learn to predict human preferences from pairwise comparison data.
Description
A reward model maps text sequences to scalar reward values, providing the training signal for RLHF. The standard architecture takes a pre-trained language model (e.g., GPT-J), removes the language model head, and adds a linear value head that projects the final hidden state to a scalar. The model is trained on pairwise comparison data where humans have indicated which of two completions they prefer.
The training objective is the Bradley-Terry pairwise ranking loss: for each comparison pair, the model should assign a higher reward to the preferred (chosen) completion than to the rejected one. The loss is the negative log-sigmoid of the reward difference, which is a smooth approximation of the 0-1 ranking loss.
Usage
Use a reward model when setting up the reward learning stage of an RLHF pipeline. A reward model is needed when: human preferences have been collected as pairwise comparisons, you want to learn a differentiable proxy of human judgment, and you need a reward signal for PPO training. The trained reward model is then used inside a reward function that wraps it for scoring during PPO training.
Theoretical Basis
The Bradley-Terry model of pairwise preferences:
Where is the reward model and is the sigmoid function.
The pairwise ranking loss:
Architecture:
# Abstract reward model structure (not real implementation)
class RewardModel:
transformer = pretrained_lm.transformer # Reuse backbone
v_head = Linear(hidden_size, 1) # Scalar projection
def forward(input_ids):
hidden = transformer(input_ids)
rewards = v_head(hidden) # Per-token rewards
return rewards[last_non_pad_token] # Return end-of-sequence reward
Training strategy:
- Initialize from SFT checkpoint (same domain)
- Freeze early layers (first 70%) for stability
- Train on pairwise comparisons with ranking loss
- Evaluate via comparison accuracy on held-out pairs