Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Princeton nlp SimPO On Policy Data Generation

From Leeroopedia


Knowledge Sources
Domains LLMs, Data_Engineering, Preference_Optimization
Last Updated 2026-02-08 04:00 GMT

Overview

End-to-end pipeline for generating on-policy preference datasets by sampling multiple responses from a language model, then scoring and binarizing them with a reward model.

Description

This workflow creates preference training data where the model itself generates candidate responses (on-policy), rather than relying on external human annotations. The process involves three stages: multi-seed response generation using vLLM for efficient batched inference, post-processing to combine responses and filter out degenerate samples, and reward model annotation to score each response and select chosen/rejected pairs. The resulting binarized dataset can be used directly with SimPO or other preference optimization algorithms. This approach was used to produce the v0.2 models (Llama-3-Instruct-8B-SimPO-v0.2) and the Gemma-2-9B-IT-SimPO model.

Key features:

  • On-policy generation: responses come from the target model, improving distribution alignment
  • Multi-seed sampling: generates diverse responses by running decoding under different random seeds
  • ArmoRM reward scoring: uses a strong reward model for automatic preference annotation
  • Binarization: converts multi-response scores into simple chosen/rejected pairs

Usage

Execute this workflow when you want to create custom preference data from an existing language model rather than using pre-existing preference datasets. This is particularly useful when: (1) improving upon an already instruction-tuned model, (2) the available preference datasets are annotated by a weaker labeler, or (3) you want the training distribution to match the model's own generation distribution (on-policy). Requires GPU resources for both vLLM inference and reward model scoring.

Execution Steps

Step 1: Multi Seed Response Generation

Generate one response per prompt for each of multiple random seeds using vLLM, a high-throughput inference engine. The model is loaded with vLLM and the source dataset prompts are formatted using the model's chat template with a generation prompt appended. Sampling parameters (temperature, top_p, max_tokens) control the diversity of generated responses. Each seed produces a separate JSON output file.

What happens:

  • Load the source dataset (default: HuggingFaceH4/ultrafeedback_binarized, train_prefs split)
  • Extract and deduplicate prompts from the dataset
  • Format each prompt using the tokenizer's chat template with generation prompt enabled
  • Run vLLM batched generation with configurable temperature (default 0.8), top_p (0.95), and max_tokens (4096)
  • Save per-seed output as JSON containing original prompt, formatted prompt, and generated text
  • Repeat under 5 different seeds (default: 13, 21, 42, 79, 100)

Note: For Gemma-2 models, the FLASHINFER attention backend must be enabled via environment variable.

Step 2: Post Processing

Combine the per-seed generation files into a single dataset and filter out degenerate samples where all seeds produced identical responses (which would provide no preference signal). The result is a clean dataset where each prompt has multiple diverse candidate responses.

What happens:

  • Read all output_*.json files from the generation directory
  • Group responses by prompt across all seeds
  • Filter out prompts where all generated responses are identical (no useful preference signal)
  • Produce a combined JSON file (all_outputs.json) with each entry containing the prompt and a list of all generated responses

Step 3: Reward Model Annotation and Binarization

Score each candidate response using a reward model, then convert the multi-response scores into a binary chosen/rejected format suitable for preference optimization training. The reward model evaluates each (prompt, response) pair independently and assigns a scalar quality score.

What happens:

  • Load the reward model (default: RLHFlow/ArmoRM-Llama3-8B-v0.1) for sequence classification
  • For each prompt, score all candidate responses by formatting as chat messages and running reward model inference
  • Save intermediate scored results (all_outputs_rm.json)
  • Binarize: select highest-scoring response as "chosen" and lowest-scoring as "rejected"
  • Format chosen/rejected pairs in OpenAI message format (role/content dicts)
  • Save as both JSON and HuggingFace Dataset format for direct use in SimPO training

Execution Diagram

GitHub URL

Workflow Repository