Implementation:CarperAI Trlx GPTRewardModel
| Knowledge Sources | |
|---|---|
| Domains | Reward_Modeling, NLP, Model_Architecture |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
Concrete tool for reward model training and inference on pairwise comparison data provided by the trlx summarization example.
Description
GPTRewardModel is a PyTorch module that wraps a pre-trained causal language model (e.g., GPT-J-6B) with a linear value head for scalar reward prediction. It extracts the transformer backbone, discards the LM head, and adds a projection from hidden size to 1. The forward pass computes per-token rewards, then extracts end-of-sequence rewards for chosen and rejected completions. Training uses the Bradley-Terry pairwise ranking loss.
The model handles both training and inference modes. During training, input is a batch of concatenated (chosen, rejected) pairs. During inference, identical pairs are passed and only the chosen score is returned.
Usage
Use GPTRewardModel when training a reward model on pairwise comparison data (e.g., CarperAI/openai_summarize_comparisons). The trained model is then loaded into a reward function for PPO training.
Code Reference
Source Location
- Repository: trlx
- File: examples/summarize_rlhf/reward_model/reward_model.py
- Lines: L6-104
Signature
class GPTRewardModel(nn.Module):
def __init__(self, model_path: str):
"""
Initialize reward model from pre-trained causal LM.
Args:
model_path: HuggingFace model name or local path
(e.g., "EleutherAI/gpt-j-6B" or SFT checkpoint path).
"""
super().__init__()
model = AutoModelForCausalLM.from_pretrained(model_path)
self.config = model.config
self.transformer = model.transformer
self.v_head = nn.Linear(self.config.n_embd, 1, bias=False)
self.tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
self.tokenizer.pad_token = self.tokenizer.eos_token
self.PAD_ID = self.tokenizer(self.tokenizer.pad_token)["input_ids"][0]
def forward(
self,
input_ids=None,
past_key_values=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
mc_token_ids=None,
labels=None,
return_dict=False,
output_attentions=False,
output_hidden_states=False,
) -> Dict:
"""
Forward pass for pairwise ranking.
Input batch has shape [2*bs, seq_len] with chosen[:bs] and rejected[bs:].
Returns:
dict with "loss", "chosen_end_scores", "rejected_end_scores"
(or just "chosen_end_scores" during inference).
"""
Import
from reward_model import GPTRewardModel
# or
from examples.summarize_rlhf.reward_model.reward_model import GPTRewardModel
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_path | str | Yes (init) | Pre-trained model name/path (e.g., SFT checkpoint) |
| input_ids | torch.LongTensor | Yes (forward) | [2*bs, seq_len] concatenated chosen+rejected |
| attention_mask | torch.LongTensor | No | Padding attention mask |
Outputs
| Name | Type | Description |
|---|---|---|
| loss | torch.Tensor | Pairwise ranking loss: -log(sigmoid(chosen - rejected)) |
| chosen_end_scores | torch.Tensor | [bs] scalar rewards for chosen completions |
| rejected_end_scores | torch.Tensor | [bs] scalar rewards for rejected completions |
Usage Examples
Training the Reward Model
from reward_model import GPTRewardModel
from transformers import Trainer, TrainingArguments
# Initialize from SFT checkpoint
model = GPTRewardModel("CarperAI/openai_summarize_tldr_sft")
# Freeze first 70% of layers
layers = model.transformer.h
num_frozen = int(0.7 * len(layers))
for layer in layers[:num_frozen]:
layer.requires_grad_(False)
# Train with HuggingFace Trainer
training_args = TrainingArguments(
output_dir="rm_checkpoint",
num_train_epochs=5,
per_device_train_batch_size=4,
learning_rate=1e-5,
deepspeed="ds_config_gpt_j.json",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=pairwise_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
Inference (Scoring Completions)
model = GPTRewardModel("EleutherAI/gpt-j-6B")
model.load_state_dict(torch.load("rm_checkpoint/pytorch_model.bin"), strict=False)
model.eval()
# Score a completion
text = "Prompt: ... TL;DR: This is a summary."
input_ids = model.tokenizer(text, return_tensors="pt").input_ids
# Duplicate for inference mode (chosen=rejected trick)
input_ids = input_ids.repeat(2, 1)
output = model(input_ids=input_ids)
score = output["chosen_end_scores"][0].item()