Heuristic:Princeton nlp SimPO Multi Seed Diversity

Knowledge Sources	SimPO On-Policy Data Gen README
Domains	Data_Generation, LLMs, Preference_Optimization
Last Updated	2026-02-08 05:00 GMT

Overview

Generate diverse candidate responses by running vLLM decoding with multiple random seeds (default: 5 seeds) and temperature 0.8 to produce varied on-policy preference data.

Description

For on-policy data generation, the SimPO pipeline generates multiple response candidates per prompt by running the decode script multiple times with different seed values. Each run produces one response per prompt under a specific random seed. The responses are then combined, deduplicated, and scored by a reward model to create preference pairs. Using multiple seeds with a moderate temperature (0.8) and top_p (0.95) ensures diversity in the candidate pool while maintaining coherence. The default seeds used are 13, 21, 42, 79, and 100.

Usage

Use this heuristic when generating on-policy preference data with the VLLM_Decode implementation. This is required before running the post-processing and reward annotation steps.

The Insight (Rule of Thumb)

Action: Run decode.py with at least 5 different seed values: {13, 21, 42, 79, 100}.
Action: Use temperature=0.8 and top_p=0.95 for balanced diversity and coherence.
Action: Set max_tokens=4096 to allow full-length responses.
Action: After generation, run post_process.py to combine outputs and filter identical responses across seeds.
Value: 5 candidate responses per prompt (one per seed).
Trade-off: 5x inference cost compared to single-response generation. More seeds increase diversity but also increase compute time linearly.

Reasoning

Preference optimization requires pairs of responses where one is clearly better than the other. Generating multiple candidates from the same model (on-policy) and scoring them with a strong reward model (e.g., ArmoRM) produces high-quality preference pairs that match the model's current distribution. Temperature 0.8 provides enough randomness for diverse outputs without degenerating into incoherent text. The post-processing step filters out prompts where all 5 seeds produced identical responses (indicating low diversity for that prompt), ensuring the training data has meaningful preference signal.

Code evidence from `on_policy_data_gen/decode.py:13-14`:

parser.add_argument('--temperature', type=float, default=0.8,
                    help='Temperature for sampling')

Code evidence from `on_policy_data_gen/decode.py:37-40`:

sampling_params = SamplingParams(temperature=args.temperature,
                                 top_p=args.top_p,
                                 max_tokens=args.max_tokens,
                                 seed=args.seed,)

From on_policy_data_gen/README.md:

"Note that you will need to run the above command under multiple different seeds (by default, we use 13, 21, 42, 79, 100) to obtain different responses for each prompt."

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment