Implementation:Princeton nlp SimPO Reward Model Annotate Script
| Knowledge Sources | |
|---|---|
| Domains | NLP, Reward_Modeling, Data_Generation |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
Concrete tool for scoring candidate responses with a reward model and binarizing into preference pairs, implemented as a standalone Python script using Transformers and NumPy.
Description
The reward_model_annotate.py script performs three operations: (1) loads a reward model (AutoModelForSequenceClassification) and tokenizer, (2) scores each candidate response by applying the chat template, tokenizing, and running a forward pass to obtain the reward score (output.score), and (3) binarizes by selecting argmax/argmin-scored responses as chosen/rejected pairs in OpenAI message format. The result is saved both as annotated JSON and as a HuggingFace Dataset on disk. This implementation combines a Wrapper Doc (for the external reward model API) with a Pattern Doc (for the binarization logic).
Usage
Run after post_process.py has produced all_outputs.json. Requires a GPU for reward model inference.
Code Reference
Source Location
- Repository: SimPO
- File: on_policy_data_gen/reward_model_annotate.py (Lines 1-86)
Signature
# External APIs used:
model = AutoModelForSequenceClassification.from_pretrained(
reward_model: str,
device_map: str = "cuda",
trust_remote_code: bool = True,
torch_dtype = torch.bfloat16,
) -> PreTrainedModel
tokenizer = AutoTokenizer.from_pretrained(
reward_model: str,
use_fast: bool = True,
) -> PreTrainedTokenizer
# Reward scoring pattern:
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
with torch.no_grad():
output = model(input_ids.to("cuda"))
score = output.score.float().item()
# Binarization pattern:
chosen_idx = np.argmax(scores)
rejected_idx = np.argmin(scores)
# Dataset conversion:
dataset = datasets.Dataset.from_list(data)
dataset.save_to_disk(output_dir)
Import
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import numpy as np
import datasets
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --generation_file | str | No | Path to all_outputs.json from post-processing (default: "datasets/gemma2_ultrafeedback/all_outputs.json") |
| --reward_model | str | No | HuggingFace reward model ID (default: "RLHFlow/ArmoRM-Llama3-8B-v0.1") |
| --output_dir | str | No | Output directory (default: "datasets/gemma2_ultrafeedback/") |
Outputs
| Name | Type | Description |
|---|---|---|
| all_outputs_rm.json | JSON file | Annotated data with "all_rm_scores" added per prompt |
| HuggingFace Dataset | Directory | Dataset with "chosen" and "rejected" columns in OpenAI message format |
Usage Examples
Running Reward Annotation
python on_policy_data_gen/reward_model_annotate.py \
--generation_file datasets/gemma2_ultrafeedback/all_outputs.json \
--reward_model RLHFlow/ArmoRM-Llama3-8B-v0.1 \
--output_dir datasets/gemma2_ultrafeedback/
Understanding the Output Format
# After annotation, each entry has reward scores:
{
"prompt": "What is machine learning?",
"all_generated_responses": ["Response A...", "Response B...", "Response C..."],
"all_rm_scores": [0.85, 0.72, 0.91], # ArmoRM scores
"chosen": [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Response C..."} # argmax score (0.91)
],
"rejected": [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Response B..."} # argmin score (0.72)
]
}
# The HuggingFace Dataset can be loaded for SimPO training:
from datasets import load_from_disk
dataset = load_from_disk("datasets/gemma2_ultrafeedback/")
print(dataset.column_names) # ['prompt', 'all_generated_responses', 'all_rm_scores', 'chosen', 'rejected']