Principle:Microsoft DeepSpeedExamples Reward Model Training
Overview
A technique for training a scalar reward model from human preference comparisons to guide reinforcement learning fine-tuning of language models.
Description
The reward model learns to assign scalar scores to model outputs by training on pairwise comparisons (chosen vs rejected). It uses a ranking loss: the model learns that the chosen response should score higher than the rejected one.
The architecture typically adds a linear value head (hidden_size -> 1) on top of a pre-trained language model. During the forward pass, the base transformer produces hidden states, and the value head projects the final hidden state to a single scalar reward value.
The training procedure works as follows:
- A batch of inputs is constructed by concatenating chosen and rejected sequences along the batch dimension.
- The full batch is passed through the transformer and the value head to obtain per-token reward scores.
- The batch is split in half: the first half corresponds to chosen responses, the second half to rejected responses.
- For each pair, the model identifies where the chosen and rejected sequences diverge (i.e., where tokens first differ).
- The ranking loss is computed only over the divergent region (up to the padding boundary), ensuring the model focuses on the meaningful differences between chosen and rejected outputs.
- The per-pair losses are averaged across the batch.
This pairwise training approach allows the reward model to learn a calibrated scoring function from relative human judgments, without requiring absolute ratings.
Usage
Use as Step 2 of the RLHF pipeline when you have human preference data (chosen/rejected pairs). The trained reward model provides the reward signal for PPO training in Step 3.
Typical workflow:
- Fine-tune a base language model with supervised learning (Step 1: SFT).
- Train a reward model on human preference data (Step 2: Reward Model Training).
- Use the reward model to provide scalar rewards during PPO-based RLHF fine-tuning (Step 3: RLHF with PPO).
The reward model checkpoint produced in Step 2 is loaded by both the critic and reward components of the RLHF engine in Step 3.
Theoretical Basis
The reward model training objective is grounded in the Bradley-Terry model for pairwise comparisons. The ranking loss is defined as:
L = -log(sigma(r(chosen) - r(rejected)))
where:
ris the reward function (scalar output of the value head),sigmais the sigmoid function,chosenis the human-preferred response,rejectedis the non-preferred response.
This loss encourages the reward model to assign a higher scalar score to the chosen response than to the rejected response. Minimizing this loss over the dataset of human comparisons produces a reward function that reflects aggregate human preferences.
In practice, the loss is computed over the divergent token positions only (from the first differing token to the end of the non-padded region), as described in the DeepSpeed-Chat implementation:
loss += -torch.nn.functional.logsigmoid(c_truncated_reward - r_truncated_reward).mean()
This is equivalent to the Bradley-Terry formulation applied at the token level and averaged across the divergent region.