Environment:Allenai Open instruct vLLM Inference
| Knowledge Sources | |
|---|---|
| Domains | Inference, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
vLLM 0.14.1 inference engine environment required for GRPO generation and async rollout collection.
Description
The GRPO training workflow uses vLLM as the inference engine for generating rollouts from the policy model. vLLM runs as a Ray actor (`LLMRayActor`) with dedicated GPUs, using Flash Attention backend and V1 API. It supports weight synchronization from the training process via Ray collective communication or NCCL process groups. The engine has configurable GPU memory utilization (default 90%) and prefix caching support.
Usage
Use this environment for GRPO reinforcement learning training. vLLM provides high-throughput generation for rollout collection, which is the primary bottleneck in on-policy RL training. Not required for SFT, DPO, or Reward Model training.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (x86_64) | vLLM not supported on macOS or aarch64 |
| Hardware | NVIDIA GPU with CUDA | Dedicated GPUs for inference (separate from training GPUs) |
| VRAM | Model-dependent | Must fit model in vLLM's memory pool |
Dependencies
Python Packages
- `vllm` == 0.14.1
- `ray[default]` >= 2.49.2
- `flash-attn` >= 2.8.3
- `torch` >= 2.9.0
Environment Variables
- `VLLM_USE_V1` = 1 (use V1 engine API)
- `VLLM_DISABLE_COMPILE_CACHE` = 1 (avoid stale compile cache)
- `VLLM_ATTENTION_BACKEND` = FLASH_ATTN
- `VLLM_ALLOW_INSECURE_SERIALIZATION` = 1 (for Ray weight transfer)
- `VLLM_LOGGING_LEVEL` = WARNING
Credentials
- `HF_TOKEN`: Required if loading gated models from HuggingFace Hub.
Quick Install
# vLLM is installed as part of the main project on Linux
uv sync
# Manual install (Linux only)
pip install vllm==0.14.1 ray[default]>=2.49.2 flash-attn>=2.8.3
Code Evidence
vLLM availability check from `conftest.py:5-10`:
try:
import vllm # noqa: F401
VLLM_AVAILABLE = True
except ImportError:
VLLM_AVAILABLE = False
Timeout constants from `vllm_utils.py:74-75`:
INFERENCE_INIT_TIMEOUT_S = 1200 # 20 minutes for engine initialization
VLLM_HEALTH_CHECK_TIMEOUT_S = 600.0 # 10 minutes for health checks
GPU memory utilization default from `data_loader.py:285`:
vllm_gpu_memory_utilization: float = 0.9
v0 caching warning from `data_loader.py:290-293`:
if os.environ.get("VLLM_USE_V1") == "0":
logger.warning("When using the v0 version of vLLM, caching is broken...")
if self.vllm_enable_prefix_caching:
raise ValueError("Prefix caching is currently not supported for v0.")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: vllm not found` | vLLM not installed (macOS) | Use Linux for GRPO training |
| `CUDA out of memory` in vLLM | GPU memory utilization too high | Reduce `vllm_gpu_memory_utilization` to 0.8 or 0.7 |
| vLLM health check timeout | Large model loading on first init | Increase VLLM_HEALTH_CHECK_TIMEOUT_S; allow 20 minutes for init |
| Prefix caching error with v0 | Using vLLM v0 API with prefix caching | Set `VLLM_USE_V1=1` or disable prefix caching |
Compatibility Notes
- macOS: vLLM is completely excluded via platform marker. GRPO training is not possible on macOS.
- ARM Linux (aarch64): vLLM support depends on the vLLM release; check vLLM docs.
- vLLM V1 vs V0: V1 is the default. V0 has known caching bugs and lacks prefix caching support.
- Weight sync: Uses Ray collective or NCCL process groups for transferring updated weights from training to inference.