Principle:Princeton nlp SimPO Multi Seed Response Generation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Generation, Inference |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
A batched inference technique that generates diverse responses to the same prompts using multiple random seeds for subsequent preference pair construction.
Description
On-policy data generation creates preference training data from the model's own outputs rather than using a static external dataset. The multi-seed response generation step produces multiple candidate responses per prompt by running inference with different random seeds. Each seed produces a different sample from the model's output distribution for the same prompt. These diverse candidates are later scored by a reward model to identify the best and worst responses, forming chosen/rejected pairs. vLLM is used as the inference engine for efficient batched generation with PagedAttention.
Usage
Use this principle when creating on-policy preference data for SimPO v2 training. This is the first step of the three-step data generation pipeline. Run the generation script multiple times with different --seed values (e.g., 42, 43, 44) to produce diverse response sets.
Theoretical Basis
Multi-seed generation leverages stochastic sampling to explore the model's output distribution:
- Temperature sampling — Controls the entropy of the output distribution (higher temperature = more diverse)
- Nucleus (top-p) sampling — Restricts sampling to the top-p probability mass
- Seed variation — Different random seeds produce different trajectories through the same distribution
Pseudo-code:
# Abstract algorithm (NOT real implementation)
for seed in [42, 43, 44, ...]:
set_random_seed(seed)
for prompt in dataset:
response = model.generate(prompt, temperature=0.8, top_p=0.95)
save(prompt, response, seed)
The diversity across seeds ensures that the reward model has meaningfully different candidates to compare, preventing trivial preference pairs.