Environment:Princeton nlp SimPO VLLM Inference
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, NLP, Inference |
| Last Updated | 2026-02-08 05:00 GMT |
Overview
Linux environment with vLLM, FlashInfer (for Gemma-2), and CUDA GPU for high-throughput batched inference during on-policy response generation.
Description
This environment provides the inference context for generating on-policy responses using vLLM's batched generation engine. It requires a CUDA-capable GPU with enough VRAM to hold the target model (e.g., Gemma-2 9B or Llama-3 8B). For Gemma-2 models, the FlashInfer attention backend must be installed and activated via the `VLLM_ATTENTION_BACKEND` environment variable. The script generates responses from prompts using temperature-based sampling and saves results as JSON.
Usage
Use this environment for the On-Policy Data Generation workflow, specifically when running the VLLM_Decode implementation to generate candidate responses with multiple seeds.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | vLLM requires Linux (no Windows/macOS support) |
| Hardware | NVIDIA GPU | Sufficient VRAM to hold the model (e.g., ~20GB for 9B model in float16) |
| Disk | 50GB+ SSD | Model weights download and output storage |
Dependencies
System Packages
- CUDA toolkit (compatible with installed PyTorch version)
Python Packages
Core:
- `vllm` (latest compatible version)
- `torch` >= 2.0
- `transformers` (for tokenizer)
- `datasets` (for loading HuggingFace datasets)
Optional (for Gemma-2 models):
- `flashinfer` (required when using `VLLM_ATTENTION_BACKEND=FLASHINFER`)
Credentials
The following environment variables may be needed:
- `HF_TOKEN`: HuggingFace API token for downloading gated models (e.g., Gemma-2)
- `VLLM_ATTENTION_BACKEND`: Set to `FLASHINFER` for Gemma-2 models (must be set before importing vllm)
Quick Install
# Install vLLM (includes torch and CUDA dependencies)
pip install vllm
# Install datasets for loading prompts
pip install datasets
# For Gemma-2 models, install FlashInfer backend
pip install flashinfer
Code Evidence
FlashInfer backend activation from `on_policy_data_gen/decode.py:4`:
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER" # this is recommended for gemma-2 models; otherwise it is not needed
vLLM model loading from `on_policy_data_gen/decode.py:28`:
llm = LLM(model=args.model)
tokenizer = llm.get_tokenizer()
Sampling configuration from `on_policy_data_gen/decode.py:37-40`:
sampling_params = SamplingParams(temperature=args.temperature,
top_p=args.top_p,
max_tokens=args.max_tokens,
seed=args.seed,)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: No module named 'vllm'` | vLLM not installed | `pip install vllm` |
| `RuntimeError: FlashInfer not found` | FlashInfer backend missing for Gemma-2 | `pip install flashinfer` and set `VLLM_ATTENTION_BACKEND=FLASHINFER` before import |
| `CUDA out of memory` | Model too large for available VRAM | Use a smaller model or enable tensor parallelism with vLLM |
| vLLM generates empty or garbled output | Wrong chat template applied | Ensure the model's tokenizer has a valid chat_template; check with `tokenizer.chat_template` |
Compatibility Notes
- Gemma-2 Models: Require `VLLM_ATTENTION_BACKEND=FLASHINFER` set as an environment variable before importing vllm. Other model families (Llama, Mistral) do not need this.
- Linux Only: vLLM officially supports Linux only. macOS and Windows are not supported.
- GPU Memory: A single GPU must hold the entire model for the default decode.py script. For larger models, configure vLLM tensor parallelism.