Workflow:Princeton nlp SimPO On Policy Data Generation

Knowledge Sources	SimPO SimPO: Simple Preference Optimization with a Reference-Free Reward
Domains	LLMs, Data_Engineering, Preference_Optimization
Last Updated	2026-02-08 04:00 GMT

Overview

End-to-end pipeline for generating on-policy preference datasets by sampling multiple responses from a language model, then scoring and binarizing them with a reward model.

Description

This workflow creates preference training data where the model itself generates candidate responses (on-policy), rather than relying on external human annotations. The process involves three stages: multi-seed response generation using vLLM for efficient batched inference, post-processing to combine responses and filter out degenerate samples, and reward model annotation to score each response and select chosen/rejected pairs. The resulting binarized dataset can be used directly with SimPO or other preference optimization algorithms. This approach was used to produce the v0.2 models (Llama-3-Instruct-8B-SimPO-v0.2) and the Gemma-2-9B-IT-SimPO model.

Key features:

On-policy generation: responses come from the target model, improving distribution alignment
Multi-seed sampling: generates diverse responses by running decoding under different random seeds
ArmoRM reward scoring: uses a strong reward model for automatic preference annotation
Binarization: converts multi-response scores into simple chosen/rejected pairs

Usage

Execute this workflow when you want to create custom preference data from an existing language model rather than using pre-existing preference datasets. This is particularly useful when: (1) improving upon an already instruction-tuned model, (2) the available preference datasets are annotated by a weaker labeler, or (3) you want the training distribution to match the model's own generation distribution (on-policy). Requires GPU resources for both vLLM inference and reward model scoring.

Execution Steps

Step 1: Multi Seed Response Generation

Generate one response per prompt for each of multiple random seeds using vLLM, a high-throughput inference engine. The model is loaded with vLLM and the source dataset prompts are formatted using the model's chat template with a generation prompt appended. Sampling parameters (temperature, top_p, max_tokens) control the diversity of generated responses. Each seed produces a separate JSON output file.

What happens:

Load the source dataset (default: HuggingFaceH4/ultrafeedback_binarized, train_prefs split)
Extract and deduplicate prompts from the dataset
Format each prompt using the tokenizer's chat template with generation prompt enabled
Run vLLM batched generation with configurable temperature (default 0.8), top_p (0.95), and max_tokens (4096)
Save per-seed output as JSON containing original prompt, formatted prompt, and generated text
Repeat under 5 different seeds (default: 13, 21, 42, 79, 100)

Note: For Gemma-2 models, the FLASHINFER attention backend must be enabled via environment variable.

Step 2: Post Processing

Combine the per-seed generation files into a single dataset and filter out degenerate samples where all seeds produced identical responses (which would provide no preference signal). The result is a clean dataset where each prompt has multiple diverse candidate responses.

What happens:

Read all output_*.json files from the generation directory
Group responses by prompt across all seeds
Filter out prompts where all generated responses are identical (no useful preference signal)
Produce a combined JSON file (all_outputs.json) with each entry containing the prompt and a list of all generated responses

Step 3: Reward Model Annotation and Binarization

Score each candidate response using a reward model, then convert the multi-response scores into a binary chosen/rejected format suitable for preference optimization training. The reward model evaluates each (prompt, response) pair independently and assigns a scalar quality score.

What happens:

Load the reward model (default: RLHFlow/ArmoRM-Llama3-8B-v0.1) for sequence classification
For each prompt, score all candidate responses by formatting as chat messages and running reward model inference
Save intermediate scored results (all_outputs_rm.json)
Binarize: select highest-scoring response as "chosen" and lowest-scoring as "rejected"
Format chosen/rejected pairs in OpenAI message format (role/content dicts)
Save as both JSON and HuggingFace Dataset format for direct use in SimPO training

Execution Diagram

GitHub URL

Workflow Repository