Implementation:Princeton nlp SimPO VLLM Decode
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Generation, Inference |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
Wrapper documentation for vLLM's LLM and SamplingParams classes as used in SimPO's on-policy response generation script.
Description
The decode script uses vLLM for high-throughput batched inference. It loads a model via vllm.LLM, formats prompts using the model's tokenizer chat template, configures sampling via vllm.SamplingParams, and generates responses in a single batched call via llm.generate(). Results are saved as JSON files named output_{seed}.json. The script is designed to be run multiple times with different --seed values to produce diverse candidate responses.
Usage
Run this script once per seed value. For typical SimPO v2 data generation, run with 3-5 different seeds.
Code Reference
Source Location
- Repository: SimPO
- File: on_policy_data_gen/decode.py (Lines 1-61)
Signature
# vLLM API (external library):
llm = vllm.LLM(model: str)
sampling_params = vllm.SamplingParams(
temperature: float = 0.8,
top_p: float = 0.95,
max_tokens: int = 4096,
seed: int = 42,
)
outputs = llm.generate(
prompts: List[str],
sampling_params: SamplingParams,
) -> List[RequestOutput]
Import
from vllm import LLM, SamplingParams
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model | str | No | HuggingFace model ID (default: "google/gemma-2-9b-it") |
| --data_dir | str | No | Dataset name or path (default: "HuggingFaceH4/ultrafeedback_binarized") |
| --temperature | float | No | Sampling temperature (default: 0.8) |
| --top_p | float | No | Nucleus sampling threshold (default: 0.95) |
| --max_tokens | int | No | Maximum generation tokens (default: 4096) |
| --seed | int | No | Random seed for reproducible sampling (default: 42) |
| --output_dir | str | No | Output directory (default: "datasets/gemma2_ultrafeedback") |
Outputs
| Name | Type | Description |
|---|---|---|
| output_{seed}.json | JSON file | List of {"prompt": str, "format_prompt": str, "generated_text": str} objects |
Usage Examples
Generate Responses With Multiple Seeds
# Generate with seed 42
python on_policy_data_gen/decode.py \
--model google/gemma-2-9b-it \
--data_dir HuggingFaceH4/ultrafeedback_binarized \
--temperature 0.8 \
--top_p 0.95 \
--max_tokens 4096 \
--seed 42 \
--output_dir datasets/gemma2_ultrafeedback
# Generate with seed 43 (different random trajectory)
python on_policy_data_gen/decode.py \
--model google/gemma-2-9b-it \
--seed 43 \
--output_dir datasets/gemma2_ultrafeedback
# Generate with seed 44
python on_policy_data_gen/decode.py \
--model google/gemma-2-9b-it \
--seed 44 \
--output_dir datasets/gemma2_ultrafeedback
Python Usage (Programmatic)
import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER" # Recommended for Gemma-2
from vllm import LLM, SamplingParams
from datasets import load_dataset
# Load model
llm = LLM(model="google/gemma-2-9b-it")
tokenizer = llm.get_tokenizer()
# Load prompts
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
prompts = sorted(list(set(dataset["prompt"])))
# Format as chat conversations
conversations = [
tokenizer.apply_chat_template(
[{"role": "user", "content": p}],
tokenize=False,
add_generation_prompt=True,
)
for p in prompts
]
# Generate
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=4096, seed=42)
outputs = llm.generate(conversations, sampling_params)
# Extract results
for i, output in enumerate(outputs):
print(f"Prompt: {prompts[i][:50]}...")
print(f"Response: {output.outputs[0].text[:100]}...")