Implementation:Princeton nlp SimPO VLLM Decode

Knowledge Sources	SimPO vLLM
Domains	NLP, Data_Generation, Inference
Last Updated	2026-02-08 04:30 GMT

Overview

Wrapper documentation for vLLM's LLM and SamplingParams classes as used in SimPO's on-policy response generation script.

Description

The decode script uses vLLM for high-throughput batched inference. It loads a model via vllm.LLM, formats prompts using the model's tokenizer chat template, configures sampling via vllm.SamplingParams, and generates responses in a single batched call via llm.generate(). Results are saved as JSON files named output_{seed}.json. The script is designed to be run multiple times with different --seed values to produce diverse candidate responses.

Usage

Run this script once per seed value. For typical SimPO v2 data generation, run with 3-5 different seeds.

Code Reference

Source Location

Repository: SimPO
File: on_policy_data_gen/decode.py (Lines 1-61)

Signature

# vLLM API (external library):
llm = vllm.LLM(model: str)

sampling_params = vllm.SamplingParams(
    temperature: float = 0.8,
    top_p: float = 0.95,
    max_tokens: int = 4096,
    seed: int = 42,
)

outputs = llm.generate(
    prompts: List[str],
    sampling_params: SamplingParams,
) -> List[RequestOutput]

Import

from vllm import LLM, SamplingParams

I/O Contract

Inputs

Name	Type	Required	Description
--model	str	No	HuggingFace model ID (default: "google/gemma-2-9b-it")
--data_dir	str	No	Dataset name or path (default: "HuggingFaceH4/ultrafeedback_binarized")
--temperature	float	No	Sampling temperature (default: 0.8)
--top_p	float	No	Nucleus sampling threshold (default: 0.95)
--max_tokens	int	No	Maximum generation tokens (default: 4096)
--seed	int	No	Random seed for reproducible sampling (default: 42)
--output_dir	str	No	Output directory (default: "datasets/gemma2_ultrafeedback")

Outputs

Name	Type	Description
output_{seed}.json	JSON file	List of {"prompt": str, "format_prompt": str, "generated_text": str} objects

Usage Examples

Generate Responses With Multiple Seeds

# Generate with seed 42
python on_policy_data_gen/decode.py \
    --model google/gemma-2-9b-it \
    --data_dir HuggingFaceH4/ultrafeedback_binarized \
    --temperature 0.8 \
    --top_p 0.95 \
    --max_tokens 4096 \
    --seed 42 \
    --output_dir datasets/gemma2_ultrafeedback

# Generate with seed 43 (different random trajectory)
python on_policy_data_gen/decode.py \
    --model google/gemma-2-9b-it \
    --seed 43 \
    --output_dir datasets/gemma2_ultrafeedback

# Generate with seed 44
python on_policy_data_gen/decode.py \
    --model google/gemma-2-9b-it \
    --seed 44 \
    --output_dir datasets/gemma2_ultrafeedback

Python Usage (Programmatic)

import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"  # Recommended for Gemma-2

from vllm import LLM, SamplingParams
from datasets import load_dataset

# Load model
llm = LLM(model="google/gemma-2-9b-it")
tokenizer = llm.get_tokenizer()

# Load prompts
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
prompts = sorted(list(set(dataset["prompt"])))

# Format as chat conversations
conversations = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": p}],
        tokenize=False,
        add_generation_prompt=True,
    )
    for p in prompts
]

# Generate
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=4096, seed=42)
outputs = llm.generate(conversations, sampling_params)

# Extract results
for i, output in enumerate(outputs):
    print(f"Prompt: {prompts[i][:50]}...")
    print(f"Response: {output.outputs[0].text[:100]}...")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment