Environment:Princeton nlp SimPO VLLM Inference

Knowledge Sources	SimPO vLLM
Domains	Infrastructure, NLP, Inference
Last Updated	2026-02-08 05:00 GMT

Overview

Linux environment with vLLM, FlashInfer (for Gemma-2), and CUDA GPU for high-throughput batched inference during on-policy response generation.

Description

This environment provides the inference context for generating on-policy responses using vLLM's batched generation engine. It requires a CUDA-capable GPU with enough VRAM to hold the target model (e.g., Gemma-2 9B or Llama-3 8B). For Gemma-2 models, the FlashInfer attention backend must be installed and activated via the `VLLM_ATTENTION_BACKEND` environment variable. The script generates responses from prompts using temperature-based sampling and saves results as JSON.

Usage

Use this environment for the On-Policy Data Generation workflow, specifically when running the VLLM_Decode implementation to generate candidate responses with multiple seeds.

System Requirements

Category	Requirement	Notes
OS	Linux	vLLM requires Linux (no Windows/macOS support)
Hardware	NVIDIA GPU	Sufficient VRAM to hold the model (e.g., ~20GB for 9B model in float16)
Disk	50GB+ SSD	Model weights download and output storage

Dependencies

System Packages

CUDA toolkit (compatible with installed PyTorch version)

Python Packages

Core:

`vllm` (latest compatible version)
`torch` >= 2.0
`transformers` (for tokenizer)
`datasets` (for loading HuggingFace datasets)

Optional (for Gemma-2 models):

`flashinfer` (required when using `VLLM_ATTENTION_BACKEND=FLASHINFER`)

Credentials

The following environment variables may be needed:

`HF_TOKEN`: HuggingFace API token for downloading gated models (e.g., Gemma-2)
`VLLM_ATTENTION_BACKEND`: Set to `FLASHINFER` for Gemma-2 models (must be set before importing vllm)

Quick Install

# Install vLLM (includes torch and CUDA dependencies)
pip install vllm

# Install datasets for loading prompts
pip install datasets

# For Gemma-2 models, install FlashInfer backend
pip install flashinfer

Code Evidence

FlashInfer backend activation from `on_policy_data_gen/decode.py:4`:

os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER" # this is recommended for gemma-2 models; otherwise it is not needed

vLLM model loading from `on_policy_data_gen/decode.py:28`:

llm = LLM(model=args.model)
tokenizer = llm.get_tokenizer()

Sampling configuration from `on_policy_data_gen/decode.py:37-40`:

sampling_params = SamplingParams(temperature=args.temperature,
                                 top_p=args.top_p,
                                 max_tokens=args.max_tokens,
                                 seed=args.seed,)

Common Errors

Error Message	Cause	Solution
`ImportError: No module named 'vllm'`	vLLM not installed	`pip install vllm`
`RuntimeError: FlashInfer not found`	FlashInfer backend missing for Gemma-2	`pip install flashinfer` and set `VLLM_ATTENTION_BACKEND=FLASHINFER` before import
`CUDA out of memory`	Model too large for available VRAM	Use a smaller model or enable tensor parallelism with vLLM
vLLM generates empty or garbled output	Wrong chat template applied	Ensure the model's tokenizer has a valid chat_template; check with `tokenizer.chat_template`

Compatibility Notes

Gemma-2 Models: Require `VLLM_ATTENTION_BACKEND=FLASHINFER` set as an environment variable before importing vllm. Other model families (Llama, Mistral) do not need this.
Linux Only: vLLM officially supports Linux only. macOS and Windows are not supported.
GPU Memory: A single GPU must hold the entire model for the default decode.py script. For larger models, configure vLLM tensor parallelism.

Related Pages

Implementation:Princeton_nlp_SimPO_VLLM_Decode

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment