Implementation:CarperAI Trlx GPTRewardModel

Knowledge Sources	CarperAI trlx
Domains	Reward_Modeling, NLP, Model_Architecture
Last Updated	2026-02-07 16:00 GMT

Overview

Concrete tool for reward model training and inference on pairwise comparison data provided by the trlx summarization example.

Description

GPTRewardModel is a PyTorch module that wraps a pre-trained causal language model (e.g., GPT-J-6B) with a linear value head for scalar reward prediction. It extracts the transformer backbone, discards the LM head, and adds a projection from hidden size to 1. The forward pass computes per-token rewards, then extracts end-of-sequence rewards for chosen and rejected completions. Training uses the Bradley-Terry pairwise ranking loss.

The model handles both training and inference modes. During training, input is a batch of concatenated (chosen, rejected) pairs. During inference, identical pairs are passed and only the chosen score is returned.

Usage

Use GPTRewardModel when training a reward model on pairwise comparison data (e.g., CarperAI/openai_summarize_comparisons). The trained model is then loaded into a reward function for PPO training.

Code Reference

Source Location

Repository: trlx
File: examples/summarize_rlhf/reward_model/reward_model.py
Lines: L6-104

Signature

class GPTRewardModel(nn.Module):
    def __init__(self, model_path: str):
        """
        Initialize reward model from pre-trained causal LM.

        Args:
            model_path: HuggingFace model name or local path
                (e.g., "EleutherAI/gpt-j-6B" or SFT checkpoint path).
        """
        super().__init__()
        model = AutoModelForCausalLM.from_pretrained(model_path)
        self.config = model.config
        self.transformer = model.transformer
        self.v_head = nn.Linear(self.config.n_embd, 1, bias=False)
        self.tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.PAD_ID = self.tokenizer(self.tokenizer.pad_token)["input_ids"][0]

    def forward(
        self,
        input_ids=None,
        past_key_values=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        mc_token_ids=None,
        labels=None,
        return_dict=False,
        output_attentions=False,
        output_hidden_states=False,
    ) -> Dict:
        """
        Forward pass for pairwise ranking.

        Input batch has shape [2*bs, seq_len] with chosen[:bs] and rejected[bs:].

        Returns:
            dict with "loss", "chosen_end_scores", "rejected_end_scores"
            (or just "chosen_end_scores" during inference).
        """

Import

from reward_model import GPTRewardModel
# or
from examples.summarize_rlhf.reward_model.reward_model import GPTRewardModel

I/O Contract

Inputs

Name	Type	Required	Description
model_path	str	Yes (init)	Pre-trained model name/path (e.g., SFT checkpoint)
input_ids	torch.LongTensor	Yes (forward)	[2*bs, seq_len] concatenated chosen+rejected
attention_mask	torch.LongTensor	No	Padding attention mask

Outputs

Name	Type	Description
loss	torch.Tensor	Pairwise ranking loss: -log(sigmoid(chosen - rejected))
chosen_end_scores	torch.Tensor	[bs] scalar rewards for chosen completions
rejected_end_scores	torch.Tensor	[bs] scalar rewards for rejected completions

Usage Examples

Training the Reward Model

from reward_model import GPTRewardModel
from transformers import Trainer, TrainingArguments

# Initialize from SFT checkpoint
model = GPTRewardModel("CarperAI/openai_summarize_tldr_sft")

# Freeze first 70% of layers
layers = model.transformer.h
num_frozen = int(0.7 * len(layers))
for layer in layers[:num_frozen]:
    layer.requires_grad_(False)

# Train with HuggingFace Trainer
training_args = TrainingArguments(
    output_dir="rm_checkpoint",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    learning_rate=1e-5,
    deepspeed="ds_config_gpt_j.json",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=pairwise_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()

Inference (Scoring Completions)

model = GPTRewardModel("EleutherAI/gpt-j-6B")
model.load_state_dict(torch.load("rm_checkpoint/pytorch_model.bin"), strict=False)
model.eval()

# Score a completion
text = "Prompt: ... TL;DR: This is a summary."
input_ids = model.tokenizer(text, return_tensors="pt").input_ids
# Duplicate for inference mode (chosen=rejected trick)
input_ids = input_ids.repeat(2, 1)
output = model(input_ids=input_ids)
score = output["chosen_end_scores"][0].item()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment