Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Princeton nlp SimPO VLLM Inference

From Leeroopedia
Revision as of 18:32, 16 February 2026 by Admin (talk | contribs) (Auto-imported from environments/Princeton_nlp_SimPO_VLLM_Inference.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Infrastructure, NLP, Inference
Last Updated 2026-02-08 05:00 GMT

Overview

Linux environment with vLLM, FlashInfer (for Gemma-2), and CUDA GPU for high-throughput batched inference during on-policy response generation.

Description

This environment provides the inference context for generating on-policy responses using vLLM's batched generation engine. It requires a CUDA-capable GPU with enough VRAM to hold the target model (e.g., Gemma-2 9B or Llama-3 8B). For Gemma-2 models, the FlashInfer attention backend must be installed and activated via the `VLLM_ATTENTION_BACKEND` environment variable. The script generates responses from prompts using temperature-based sampling and saves results as JSON.

Usage

Use this environment for the On-Policy Data Generation workflow, specifically when running the VLLM_Decode implementation to generate candidate responses with multiple seeds.

System Requirements

Category Requirement Notes
OS Linux vLLM requires Linux (no Windows/macOS support)
Hardware NVIDIA GPU Sufficient VRAM to hold the model (e.g., ~20GB for 9B model in float16)
Disk 50GB+ SSD Model weights download and output storage

Dependencies

System Packages

  • CUDA toolkit (compatible with installed PyTorch version)

Python Packages

Core:

  • `vllm` (latest compatible version)
  • `torch` >= 2.0
  • `transformers` (for tokenizer)
  • `datasets` (for loading HuggingFace datasets)

Optional (for Gemma-2 models):

  • `flashinfer` (required when using `VLLM_ATTENTION_BACKEND=FLASHINFER`)

Credentials

The following environment variables may be needed:

  • `HF_TOKEN`: HuggingFace API token for downloading gated models (e.g., Gemma-2)
  • `VLLM_ATTENTION_BACKEND`: Set to `FLASHINFER` for Gemma-2 models (must be set before importing vllm)

Quick Install

# Install vLLM (includes torch and CUDA dependencies)
pip install vllm

# Install datasets for loading prompts
pip install datasets

# For Gemma-2 models, install FlashInfer backend
pip install flashinfer

Code Evidence

FlashInfer backend activation from `on_policy_data_gen/decode.py:4`:

os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER" # this is recommended for gemma-2 models; otherwise it is not needed

vLLM model loading from `on_policy_data_gen/decode.py:28`:

llm = LLM(model=args.model)
tokenizer = llm.get_tokenizer()

Sampling configuration from `on_policy_data_gen/decode.py:37-40`:

sampling_params = SamplingParams(temperature=args.temperature,
                                 top_p=args.top_p,
                                 max_tokens=args.max_tokens,
                                 seed=args.seed,)

Common Errors

Error Message Cause Solution
`ImportError: No module named 'vllm'` vLLM not installed `pip install vllm`
`RuntimeError: FlashInfer not found` FlashInfer backend missing for Gemma-2 `pip install flashinfer` and set `VLLM_ATTENTION_BACKEND=FLASHINFER` before import
`CUDA out of memory` Model too large for available VRAM Use a smaller model or enable tensor parallelism with vLLM
vLLM generates empty or garbled output Wrong chat template applied Ensure the model's tokenizer has a valid chat_template; check with `tokenizer.chat_template`

Compatibility Notes

  • Gemma-2 Models: Require `VLLM_ATTENTION_BACKEND=FLASHINFER` set as an environment variable before importing vllm. Other model families (Llama, Mistral) do not need this.
  • Linux Only: vLLM officially supports Linux only. macOS and Windows are not supported.
  • GPU Memory: A single GPU must hold the entire model for the default decode.py script. For larger models, configure vLLM tensor parallelism.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment