Principle:Microsoft DeepSpeedExamples Reward Model Training

Overview

A technique for training a scalar reward model from human preference comparisons to guide reinforcement learning fine-tuning of language models.

Description

The reward model learns to assign scalar scores to model outputs by training on pairwise comparisons (chosen vs rejected). It uses a ranking loss: the model learns that the chosen response should score higher than the rejected one.

The architecture typically adds a linear value head (hidden_size -> 1) on top of a pre-trained language model. During the forward pass, the base transformer produces hidden states, and the value head projects the final hidden state to a single scalar reward value.

The training procedure works as follows:

A batch of inputs is constructed by concatenating chosen and rejected sequences along the batch dimension.
The full batch is passed through the transformer and the value head to obtain per-token reward scores.
The batch is split in half: the first half corresponds to chosen responses, the second half to rejected responses.
For each pair, the model identifies where the chosen and rejected sequences diverge (i.e., where tokens first differ).
The ranking loss is computed only over the divergent region (up to the padding boundary), ensuring the model focuses on the meaningful differences between chosen and rejected outputs.
The per-pair losses are averaged across the batch.

This pairwise training approach allows the reward model to learn a calibrated scoring function from relative human judgments, without requiring absolute ratings.

Usage

Use as Step 2 of the RLHF pipeline when you have human preference data (chosen/rejected pairs). The trained reward model provides the reward signal for PPO training in Step 3.

Typical workflow:

Fine-tune a base language model with supervised learning (Step 1: SFT).
Train a reward model on human preference data (Step 2: Reward Model Training).
Use the reward model to provide scalar rewards during PPO-based RLHF fine-tuning (Step 3: RLHF with PPO).

The reward model checkpoint produced in Step 2 is loaded by both the critic and reward components of the RLHF engine in Step 3.

Theoretical Basis

The reward model training objective is grounded in the Bradley-Terry model for pairwise comparisons. The ranking loss is defined as:

L = -log(sigma(r(chosen) - r(rejected)))

where:

r is the reward function (scalar output of the value head),
sigma is the sigmoid function,
chosen is the human-preferred response,
rejected is the non-preferred response.

This loss encourages the reward model to assign a higher scalar score to the chosen response than to the rejected response. Minimizing this loss over the dataset of human comparisons produces a reward function that reflects aggregate human preferences.

In practice, the loss is computed over the divergent token positions only (from the first differing token to the end of the non-padded region), as described in the DeepSpeed-Chat implementation:

loss += -torch.nn.functional.logsigmoid(c_truncated_reward - r_truncated_reward).mean()

This is equivalent to the Bradley-Terry formulation applied at the token level and averaged across the divergent region.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment