Environment:Alibaba ROLL vLLM Inference Environment

Knowledge Sources	Alibaba ROLL vLLM
Domains	Infrastructure, LLM_Inference
Last Updated	2026-02-07 19:00 GMT

Overview

vLLM inference backend environment supporting versions 0.8.4, 0.10.2, 0.11.0, 0.11.1, and 0.12.0 with version-specific patches for ROLL integration.

Description

This environment provides the vLLM high-throughput inference backend for ROLL. The framework applies version-specific monkey patches to customize Ray distributed execution, sleep mode, and worker management. Only explicitly tested vLLM versions are supported; untested versions produce a warning and may exhibit unexpected behavior. Key configuration includes disabling PyTorch expandable memory segments (`expandable_segments:False`), setting worker multiprocess method to `spawn`, and configuring per-worker cache directories for both vLLM and FlashInfer.

Usage

Use this environment when configuring actor_infer workers with the vLLM backend for high-throughput LLM text generation during rollout in RLVR, Agentic, and other pipelines that require response generation.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA or AMD GPU	See CUDA/ROCm environment pages
VRAM	Controlled by `gpu_memory_utilization`	Default 0.8 (80% of GPU memory)

Dependencies

Python Packages

`vllm` == 0.8.4 (torch 2.6.0) or 0.10.2 (torch 2.8.0) or 0.11.0 or 0.12.0
`flashinfer` (installed as vLLM dependency)
`packaging` (for version comparison)

Credentials

`VLLM_USE_V1`: Controls vLLM engine version (default `1`; set to `0` on ROCm)
`VLLM_PORT`: vLLM server port (set internally)
`VLLM_CACHE_ROOT`: Cache directory (set internally per worker)
`VLLM_WORKER_MULTIPROC_METHOD`: Must be `spawn` (set internally)
`FLASHINFER_WORKSPACE_BASE`: FlashInfer workspace directory (set internally per worker)
`VLLM_ALLREDUCE_USE_SYMM_MEM`: Set to `0` to workaround vLLM 0.11.0 bug

Quick Install

# For torch 2.6.0 setup
pip install vllm==0.8.4

# For torch 2.8.0 setup
pip install vllm==0.10.2

# Or use the combined requirements
pip install -r requirements_torch260_vllm.txt
pip install -r requirements_torch280_vllm.txt

Code Evidence

Version-specific patching from `roll/third_party/vllm/__init__.py:20-36`:

if Version("0.8.4") == Version(vllm.__version__):
    import roll.third_party.vllm.vllm_0_8_4  # apply patch
elif Version("0.10.2") == Version(vllm.__version__):
    ray_executor_class_v0 = safe_import_class(...)
elif Version("0.11.0") == Version(vllm.__version__):
    ray_executor_class_v0 = safe_import_class(...)
elif Version("0.12.0") == Version(vllm.__version__):
    ray_executor_class_v0 = None  # V0 deprecated
else:
    logger.warning(f"ROLL is not tested on vllm version {vllm.__version__}")

Memory allocator override from `roll/third_party/vllm/__init__.py:53-56`:

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = ""
torch.cuda.memory._set_allocator_settings("expandable_segments:False")

Spawn method requirement from `roll/third_party/vllm/__init__.py:65`:

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

Common Errors

Error Message	Cause	Solution
`ROLL is not tested on vllm version X`	Unsupported vLLM version	Install one of: 0.8.4, 0.10.2, 0.11.0, 0.12.0
`ROLL does not support using ray distributed executor`	Version mismatch between vLLM and ROLL patches	Match vLLM version to ROLL's tested versions
`CUDA out of memory` during inference	GPU memory utilization too high	Reduce `gpu_memory_utilization` (default 0.8) or use `sleep_level: 2`

Compatibility Notes

vLLM 0.12.0: V0 engine deprecated; only V1 executor available.
ROCm: vLLM V1 disabled (`VLLM_USE_V1=0`).
Ascend NPU: Uses `vllm-ascend` package with NPUWorker class.
Sleep Level: `1` destroys KV cache only; `2` destroys model weights and KV cache.
Load Format: Set to `dummy` since ROLL updates model weights at startup.

Related Pages

Implementation:Alibaba_ROLL_VllmStrategy_Generate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment