Environment:Alibaba ROLL vLLM Inference Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLM_Inference |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
vLLM inference backend environment supporting versions 0.8.4, 0.10.2, 0.11.0, 0.11.1, and 0.12.0 with version-specific patches for ROLL integration.
Description
This environment provides the vLLM high-throughput inference backend for ROLL. The framework applies version-specific monkey patches to customize Ray distributed execution, sleep mode, and worker management. Only explicitly tested vLLM versions are supported; untested versions produce a warning and may exhibit unexpected behavior. Key configuration includes disabling PyTorch expandable memory segments (`expandable_segments:False`), setting worker multiprocess method to `spawn`, and configuring per-worker cache directories for both vLLM and FlashInfer.
Usage
Use this environment when configuring actor_infer workers with the vLLM backend for high-throughput LLM text generation during rollout in RLVR, Agentic, and other pipelines that require response generation.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA or AMD GPU | See CUDA/ROCm environment pages |
| VRAM | Controlled by `gpu_memory_utilization` | Default 0.8 (80% of GPU memory) |
Dependencies
Python Packages
- `vllm` == 0.8.4 (torch 2.6.0) or 0.10.2 (torch 2.8.0) or 0.11.0 or 0.12.0
- `flashinfer` (installed as vLLM dependency)
- `packaging` (for version comparison)
Credentials
- `VLLM_USE_V1`: Controls vLLM engine version (default `1`; set to `0` on ROCm)
- `VLLM_PORT`: vLLM server port (set internally)
- `VLLM_CACHE_ROOT`: Cache directory (set internally per worker)
- `VLLM_WORKER_MULTIPROC_METHOD`: Must be `spawn` (set internally)
- `FLASHINFER_WORKSPACE_BASE`: FlashInfer workspace directory (set internally per worker)
- `VLLM_ALLREDUCE_USE_SYMM_MEM`: Set to `0` to workaround vLLM 0.11.0 bug
Quick Install
# For torch 2.6.0 setup
pip install vllm==0.8.4
# For torch 2.8.0 setup
pip install vllm==0.10.2
# Or use the combined requirements
pip install -r requirements_torch260_vllm.txt
pip install -r requirements_torch280_vllm.txt
Code Evidence
Version-specific patching from `roll/third_party/vllm/__init__.py:20-36`:
if Version("0.8.4") == Version(vllm.__version__):
import roll.third_party.vllm.vllm_0_8_4 # apply patch
elif Version("0.10.2") == Version(vllm.__version__):
ray_executor_class_v0 = safe_import_class(...)
elif Version("0.11.0") == Version(vllm.__version__):
ray_executor_class_v0 = safe_import_class(...)
elif Version("0.12.0") == Version(vllm.__version__):
ray_executor_class_v0 = None # V0 deprecated
else:
logger.warning(f"ROLL is not tested on vllm version {vllm.__version__}")
Memory allocator override from `roll/third_party/vllm/__init__.py:53-56`:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = ""
torch.cuda.memory._set_allocator_settings("expandable_segments:False")
Spawn method requirement from `roll/third_party/vllm/__init__.py:65`:
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ROLL is not tested on vllm version X` | Unsupported vLLM version | Install one of: 0.8.4, 0.10.2, 0.11.0, 0.12.0 |
| `ROLL does not support using ray distributed executor` | Version mismatch between vLLM and ROLL patches | Match vLLM version to ROLL's tested versions |
| `CUDA out of memory` during inference | GPU memory utilization too high | Reduce `gpu_memory_utilization` (default 0.8) or use `sleep_level: 2` |
Compatibility Notes
- vLLM 0.12.0: V0 engine deprecated; only V1 executor available.
- ROCm: vLLM V1 disabled (`VLLM_USE_V1=0`).
- Ascend NPU: Uses `vllm-ascend` package with NPUWorker class.
- Sleep Level: `1` destroys KV cache only; `2` destroys model weights and KV cache.
- Load Format: Set to `dummy` since ROLL updates model weights at startup.