Principle:Huggingface Trl Reward Model Training
| Property | Value |
|---|---|
| Principle Name | Reward Model Training |
| Technology | Huggingface TRL |
| Category | Training |
| Workflow | Reward Model Training |
| Paper | InstructGPT (https://arxiv.org/abs/2203.02155) |
| Implementation | Implementation:Huggingface_Trl_RewardTrainer_Init_Train |
Overview
Description
The RewardTrainer is the core component for training reward models in the RLHF pipeline. It implements the Bradley-Terry pairwise preference loss, which trains a model to assign higher scalar rewards to human-preferred responses. The trainer extends the standard Huggingface Trainer (via BaseTrainer) with reward-specific loss computation, metric tracking, and model initialization logic.
The training objective minimizes the negative log-likelihood that the model correctly ranks chosen responses above rejected responses:
L = -log(sigmoid(r_chosen - r_rejected))
This loss naturally trains the reward model to produce reward values where the difference between chosen and rejected scores reflects the strength of preference.
Usage
The RewardTrainer is instantiated with a model (or model path), optional RewardConfig, training dataset, and optional PEFT configuration. It supports both full fine-tuning and parameter-efficient training. Training is launched by calling the train() method, which follows the standard Huggingface Trainer loop.
Theoretical Basis
Bradley-Terry Model
The Bradley-Terry model is a probabilistic framework for pairwise comparisons. Given two items with "skill" parameters (rewards) r_1 and r_2, the probability that item 1 is preferred over item 2 is:
P(1 > 2) = exp(r_1) / (exp(r_1) + exp(r_2)) = sigmoid(r_1 - r_2)
The negative log-likelihood of the observed preferences yields the training loss:
L = -log(sigmoid(r_chosen - r_rejected))
This loss has several desirable properties:
- Convexity: The loss is convex in the reward difference, ensuring stable optimization.
- Scale invariance: Only the difference between rewards matters, not their absolute values.
- Gradient signal: The gradient is proportional to 1 - P(chosen > rejected), providing stronger updates for incorrectly ranked pairs.
Margin-Based Loss
When margin annotations are available in the dataset, the loss is modified to account for the degree of preference:
L = -log(sigmoid(r_chosen - r_rejected - margin))
The margin shifts the decision boundary, requiring the reward model to produce larger reward differences for strongly preferred responses.
Center Rewards Regularization
An optional regularization term encourages mean-zero reward outputs:
L_total = L_preference + coefficient * mean((r_chosen + r_rejected)^2)
This prevents reward drift where all rewards shift to large positive or negative values, which can cause instability in downstream PPO training.
Training Metrics
The trainer tracks the following metrics during training and evaluation:
| Metric | Description |
|---|---|
| accuracy | Fraction of pairs where r_chosen > r_rejected |
| margin | Mean difference (r_chosen - r_rejected) |
| min_reward | Minimum reward value in the batch |
| mean_reward | Mean reward value across all responses |
| max_reward | Maximum reward value in the batch |
These metrics provide diagnostic insight into reward model behavior. An accuracy near 1.0 with a healthy margin indicates good preference discrimination.