Principle:CarperAI Trlx Reward Model Architecture

Knowledge Sources	Learning to Summarize from Human Feedback Training language models to follow instructions with human feedback CarperAI trlx
Domains	Reward_Modeling, NLP, Model_Architecture
Last Updated	2026-02-07 16:00 GMT

Overview

An architectural principle for building reward models that learn to predict human preferences from pairwise comparison data.

Description

A reward model maps text sequences to scalar reward values, providing the training signal for RLHF. The standard architecture takes a pre-trained language model (e.g., GPT-J), removes the language model head, and adds a linear value head that projects the final hidden state to a scalar. The model is trained on pairwise comparison data where humans have indicated which of two completions they prefer.

The training objective is the Bradley-Terry pairwise ranking loss: for each comparison pair, the model should assign a higher reward to the preferred (chosen) completion than to the rejected one. The loss is the negative log-sigmoid of the reward difference, which is a smooth approximation of the 0-1 ranking loss.

Usage

Use a reward model when setting up the reward learning stage of an RLHF pipeline. A reward model is needed when: human preferences have been collected as pairwise comparisons, you want to learn a differentiable proxy of human judgment, and you need a reward signal for PPO training. The trained reward model is then used inside a reward function that wraps it for scoring during PPO training.

Theoretical Basis

The Bradley-Terry model of pairwise preferences:

$P (y_{w} ≻ y_{l} | x) = σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l}))$

Where $r_{θ}$ is the reward model and $σ$ is the sigmoid function.

The pairwise ranking loss:

$L (θ) = - E_{(x, y_{w}, y_{l}) \sim D} [\log σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l}))]$

Architecture:

# Abstract reward model structure (not real implementation)
class RewardModel:
    transformer = pretrained_lm.transformer  # Reuse backbone
    v_head = Linear(hidden_size, 1)          # Scalar projection

    def forward(input_ids):
        hidden = transformer(input_ids)
        rewards = v_head(hidden)             # Per-token rewards
        return rewards[last_non_pad_token]   # Return end-of-sequence reward

Training strategy:

Initialize from SFT checkpoint (same domain)
Freeze early layers (first 70%) for stability
Train on pairwise comparisons with ranking loss
Evaluate via comparison accuracy on held-out pairs

Related Pages

Implemented By

Implementation:CarperAI_Trlx_GPTRewardModel

Uses Heuristic

Heuristic:CarperAI_Trlx_Reward_Model_Layer_Freezing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment