Principle:Princeton nlp SimPO Reward Model Annotation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Reward_Modeling, Data_Generation |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
A scoring and binarization process that uses a reward model to rank candidate responses and construct chosen/rejected preference pairs.
Description
Reward model annotation is the final step in on-policy data generation. A pre-trained reward model (ArmoRM) scores each candidate response for a given prompt. The highest-scoring response becomes the chosen example and the lowest-scoring becomes the rejected example, forming a binary preference pair suitable for SimPO training. This approach generates on-policy data — preference pairs derived from the model's own outputs — which can improve training quality compared to using only off-policy datasets. The reward model acts as a proxy for human preferences, enabling automated preference annotation at scale.
Usage
Use this principle after post-processing multi-seed responses. The output is a HuggingFace Dataset with chosen and rejected columns in OpenAI message format, ready to be used as training data for SimPO.
Theoretical Basis
The binarization process follows best-of-N selection:
- Score all candidates — For each prompt, compute reward scores for all candidate responses
- Select extremes — Choose the highest-scoring response (argmax) as chosen and the lowest-scoring (argmin) as rejected
- Format as preferences — Construct OpenAI message format pairs for each prompt
Where R(x, y) is the reward model's score for response y given prompt x, and Y is the set of all candidate responses.
This approach creates maximally informative preference pairs by selecting the most extreme quality contrast available.