Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Princeton nlp SimPO Reward Model Annotate Script

From Leeroopedia


Knowledge Sources
Domains NLP, Reward_Modeling, Data_Generation
Last Updated 2026-02-08 04:30 GMT

Overview

Concrete tool for scoring candidate responses with a reward model and binarizing into preference pairs, implemented as a standalone Python script using Transformers and NumPy.

Description

The reward_model_annotate.py script performs three operations: (1) loads a reward model (AutoModelForSequenceClassification) and tokenizer, (2) scores each candidate response by applying the chat template, tokenizing, and running a forward pass to obtain the reward score (output.score), and (3) binarizes by selecting argmax/argmin-scored responses as chosen/rejected pairs in OpenAI message format. The result is saved both as annotated JSON and as a HuggingFace Dataset on disk. This implementation combines a Wrapper Doc (for the external reward model API) with a Pattern Doc (for the binarization logic).

Usage

Run after post_process.py has produced all_outputs.json. Requires a GPU for reward model inference.

Code Reference

Source Location

  • Repository: SimPO
  • File: on_policy_data_gen/reward_model_annotate.py (Lines 1-86)

Signature

# External APIs used:
model = AutoModelForSequenceClassification.from_pretrained(
    reward_model: str,
    device_map: str = "cuda",
    trust_remote_code: bool = True,
    torch_dtype = torch.bfloat16,
) -> PreTrainedModel

tokenizer = AutoTokenizer.from_pretrained(
    reward_model: str,
    use_fast: bool = True,
) -> PreTrainedTokenizer

# Reward scoring pattern:
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
with torch.no_grad():
    output = model(input_ids.to("cuda"))
    score = output.score.float().item()

# Binarization pattern:
chosen_idx = np.argmax(scores)
rejected_idx = np.argmin(scores)

# Dataset conversion:
dataset = datasets.Dataset.from_list(data)
dataset.save_to_disk(output_dir)

Import

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import numpy as np
import datasets

I/O Contract

Inputs

Name Type Required Description
--generation_file str No Path to all_outputs.json from post-processing (default: "datasets/gemma2_ultrafeedback/all_outputs.json")
--reward_model str No HuggingFace reward model ID (default: "RLHFlow/ArmoRM-Llama3-8B-v0.1")
--output_dir str No Output directory (default: "datasets/gemma2_ultrafeedback/")

Outputs

Name Type Description
all_outputs_rm.json JSON file Annotated data with "all_rm_scores" added per prompt
HuggingFace Dataset Directory Dataset with "chosen" and "rejected" columns in OpenAI message format

Usage Examples

Running Reward Annotation

python on_policy_data_gen/reward_model_annotate.py \
    --generation_file datasets/gemma2_ultrafeedback/all_outputs.json \
    --reward_model RLHFlow/ArmoRM-Llama3-8B-v0.1 \
    --output_dir datasets/gemma2_ultrafeedback/

Understanding the Output Format

# After annotation, each entry has reward scores:
{
    "prompt": "What is machine learning?",
    "all_generated_responses": ["Response A...", "Response B...", "Response C..."],
    "all_rm_scores": [0.85, 0.72, 0.91],  # ArmoRM scores
    "chosen": [
        {"role": "user", "content": "What is machine learning?"},
        {"role": "assistant", "content": "Response C..."}  # argmax score (0.91)
    ],
    "rejected": [
        {"role": "user", "content": "What is machine learning?"},
        {"role": "assistant", "content": "Response B..."}  # argmin score (0.72)
    ]
}

# The HuggingFace Dataset can be loaded for SimPO training:
from datasets import load_from_disk
dataset = load_from_disk("datasets/gemma2_ultrafeedback/")
print(dataset.column_names)  # ['prompt', 'all_generated_responses', 'all_rm_scores', 'chosen', 'rejected']

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment