Workflow:Princeton nlp SimPO On Policy Data Generation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Data_Engineering, Preference_Optimization |
| Last Updated | 2026-02-08 04:00 GMT |
Overview
End-to-end pipeline for generating on-policy preference datasets by sampling multiple responses from a language model, then scoring and binarizing them with a reward model.
Description
This workflow creates preference training data where the model itself generates candidate responses (on-policy), rather than relying on external human annotations. The process involves three stages: multi-seed response generation using vLLM for efficient batched inference, post-processing to combine responses and filter out degenerate samples, and reward model annotation to score each response and select chosen/rejected pairs. The resulting binarized dataset can be used directly with SimPO or other preference optimization algorithms. This approach was used to produce the v0.2 models (Llama-3-Instruct-8B-SimPO-v0.2) and the Gemma-2-9B-IT-SimPO model.
Key features:
- On-policy generation: responses come from the target model, improving distribution alignment
- Multi-seed sampling: generates diverse responses by running decoding under different random seeds
- ArmoRM reward scoring: uses a strong reward model for automatic preference annotation
- Binarization: converts multi-response scores into simple chosen/rejected pairs
Usage
Execute this workflow when you want to create custom preference data from an existing language model rather than using pre-existing preference datasets. This is particularly useful when: (1) improving upon an already instruction-tuned model, (2) the available preference datasets are annotated by a weaker labeler, or (3) you want the training distribution to match the model's own generation distribution (on-policy). Requires GPU resources for both vLLM inference and reward model scoring.
Execution Steps
Step 1: Multi Seed Response Generation
Generate one response per prompt for each of multiple random seeds using vLLM, a high-throughput inference engine. The model is loaded with vLLM and the source dataset prompts are formatted using the model's chat template with a generation prompt appended. Sampling parameters (temperature, top_p, max_tokens) control the diversity of generated responses. Each seed produces a separate JSON output file.
What happens:
- Load the source dataset (default: HuggingFaceH4/ultrafeedback_binarized, train_prefs split)
- Extract and deduplicate prompts from the dataset
- Format each prompt using the tokenizer's chat template with generation prompt enabled
- Run vLLM batched generation with configurable temperature (default 0.8), top_p (0.95), and max_tokens (4096)
- Save per-seed output as JSON containing original prompt, formatted prompt, and generated text
- Repeat under 5 different seeds (default: 13, 21, 42, 79, 100)
Note: For Gemma-2 models, the FLASHINFER attention backend must be enabled via environment variable.
Step 2: Post Processing
Combine the per-seed generation files into a single dataset and filter out degenerate samples where all seeds produced identical responses (which would provide no preference signal). The result is a clean dataset where each prompt has multiple diverse candidate responses.
What happens:
- Read all output_*.json files from the generation directory
- Group responses by prompt across all seeds
- Filter out prompts where all generated responses are identical (no useful preference signal)
- Produce a combined JSON file (all_outputs.json) with each entry containing the prompt and a list of all generated responses
Step 3: Reward Model Annotation and Binarization
Score each candidate response using a reward model, then convert the multi-response scores into a binary chosen/rejected format suitable for preference optimization training. The reward model evaluates each (prompt, response) pair independently and assigns a scalar quality score.
What happens:
- Load the reward model (default: RLHFlow/ArmoRM-Llama3-8B-v0.1) for sequence classification
- For each prompt, score all candidate responses by formatting as chat messages and running reward model inference
- Save intermediate scored results (all_outputs_rm.json)
- Binarize: select highest-scoring response as "chosen" and lowest-scoring as "rejected"
- Format chosen/rejected pairs in OpenAI message format (role/content dicts)
- Save as both JSON and HuggingFace Dataset format for direct use in SimPO training