Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:CarperAI Trlx Reward Model Architecture

From Leeroopedia


Knowledge Sources
Domains Reward_Modeling, NLP, Model_Architecture
Last Updated 2026-02-07 16:00 GMT

Overview

An architectural principle for building reward models that learn to predict human preferences from pairwise comparison data.

Description

A reward model maps text sequences to scalar reward values, providing the training signal for RLHF. The standard architecture takes a pre-trained language model (e.g., GPT-J), removes the language model head, and adds a linear value head that projects the final hidden state to a scalar. The model is trained on pairwise comparison data where humans have indicated which of two completions they prefer.

The training objective is the Bradley-Terry pairwise ranking loss: for each comparison pair, the model should assign a higher reward to the preferred (chosen) completion than to the rejected one. The loss is the negative log-sigmoid of the reward difference, which is a smooth approximation of the 0-1 ranking loss.

Usage

Use a reward model when setting up the reward learning stage of an RLHF pipeline. A reward model is needed when: human preferences have been collected as pairwise comparisons, you want to learn a differentiable proxy of human judgment, and you need a reward signal for PPO training. The trained reward model is then used inside a reward function that wraps it for scoring during PPO training.

Theoretical Basis

The Bradley-Terry model of pairwise preferences:

P(ywyl|x)=σ(rθ(x,yw)rθ(x,yl))

Where rθ is the reward model and σ is the sigmoid function.

The pairwise ranking loss:

L(θ)=E(x,yw,yl)D[logσ(rθ(x,yw)rθ(x,yl))]

Architecture:

# Abstract reward model structure (not real implementation)
class RewardModel:
    transformer = pretrained_lm.transformer  # Reuse backbone
    v_head = Linear(hidden_size, 1)          # Scalar projection

    def forward(input_ids):
        hidden = transformer(input_ids)
        rewards = v_head(hidden)             # Per-token rewards
        return rewards[last_non_pad_token]   # Return end-of-sequence reward

Training strategy:

  • Initialize from SFT checkpoint (same domain)
  • Freeze early layers (first 70%) for stability
  • Train on pairwise comparisons with ranking loss
  • Evaluate via comparison accuracy on held-out pairs

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment