Principle:Deepspeedai DeepSpeed Reward Model Training

Overview

Training a reward model to score language model outputs based on human preference data, serving as the optimization signal for reinforcement learning.

Description

The reward model is a language model with a scalar output head that learns to predict human preferences between pairs of responses. It is initialized from the SFT checkpoint and fine-tuned on comparison data consisting of preferred and rejected responses. The reward model assigns scalar scores to generated text, which the PPO algorithm uses as reward signals during the RLHF phase. Training uses standard DeepSpeed distributed training with ZeRO optimization.

In the DeepSpeed-Chat framework, reward model training is referred to as Step 2. The reward model architecture takes the base SFT model and replaces or augments its language modeling head with a linear projection that produces a single scalar value. Given a prompt and a response, the reward model outputs a score indicating how well the response aligns with human preferences. During training, the model processes pairs of responses (one preferred, one rejected) for the same prompt and learns to assign higher scores to the preferred responses.

Like SFT, reward model training does not involve any text generation, so it uses the standard DeepSpeedEngine rather than the Hybrid Engine. The training pipeline handles the comparison data format where each sample contains a prompt, a chosen response, and a rejected response. The model is trained to maximize the margin between the scores assigned to preferred responses versus rejected responses.

Theoretical Basis

Reward model training is based on the Bradley-Terry model for pairwise comparisons. The probability that response y_w is preferred over response y_l given prompt x is modeled as:

P(y_w > y_l | x) = sigma(r(x, y_w) - r(x, y_l))

where r is the reward model, y_w is the preferred (winning) response, y_l is the rejected (losing) response, and sigma is the sigmoid function.

The training loss is the negative log-likelihood of the observed preferences:

L = -log(sigma(r(x, y_w) - r(x, y_l)))

This loss pushes the reward model to assign higher scores to preferred responses and lower scores to rejected ones. The scalar difference between the two scores is calibrated through the sigmoid to produce a valid probability, ensuring the reward model captures relative preference strengths rather than just ordinal rankings.

References

InstructGPT: Training language models to follow instructions with human feedback — https://arxiv.org/abs/2203.02155
Deep reinforcement learning from human preferences — https://arxiv.org/abs/1706.03741

Related Pages

Implementation:Deepspeedai_DeepSpeed_Initialize_For_RM

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment