Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Alibaba ROLL vLLM Inference Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, LLM_Inference
Last Updated 2026-02-07 19:00 GMT

Overview

vLLM inference backend environment supporting versions 0.8.4, 0.10.2, 0.11.0, 0.11.1, and 0.12.0 with version-specific patches for ROLL integration.

Description

This environment provides the vLLM high-throughput inference backend for ROLL. The framework applies version-specific monkey patches to customize Ray distributed execution, sleep mode, and worker management. Only explicitly tested vLLM versions are supported; untested versions produce a warning and may exhibit unexpected behavior. Key configuration includes disabling PyTorch expandable memory segments (`expandable_segments:False`), setting worker multiprocess method to `spawn`, and configuring per-worker cache directories for both vLLM and FlashInfer.

Usage

Use this environment when configuring actor_infer workers with the vLLM backend for high-throughput LLM text generation during rollout in RLVR, Agentic, and other pipelines that require response generation.

System Requirements

Category Requirement Notes
Hardware NVIDIA or AMD GPU See CUDA/ROCm environment pages
VRAM Controlled by `gpu_memory_utilization` Default 0.8 (80% of GPU memory)

Dependencies

Python Packages

  • `vllm` == 0.8.4 (torch 2.6.0) or 0.10.2 (torch 2.8.0) or 0.11.0 or 0.12.0
  • `flashinfer` (installed as vLLM dependency)
  • `packaging` (for version comparison)

Credentials

  • `VLLM_USE_V1`: Controls vLLM engine version (default `1`; set to `0` on ROCm)
  • `VLLM_PORT`: vLLM server port (set internally)
  • `VLLM_CACHE_ROOT`: Cache directory (set internally per worker)
  • `VLLM_WORKER_MULTIPROC_METHOD`: Must be `spawn` (set internally)
  • `FLASHINFER_WORKSPACE_BASE`: FlashInfer workspace directory (set internally per worker)
  • `VLLM_ALLREDUCE_USE_SYMM_MEM`: Set to `0` to workaround vLLM 0.11.0 bug

Quick Install

# For torch 2.6.0 setup
pip install vllm==0.8.4

# For torch 2.8.0 setup
pip install vllm==0.10.2

# Or use the combined requirements
pip install -r requirements_torch260_vllm.txt
pip install -r requirements_torch280_vllm.txt

Code Evidence

Version-specific patching from `roll/third_party/vllm/__init__.py:20-36`:

if Version("0.8.4") == Version(vllm.__version__):
    import roll.third_party.vllm.vllm_0_8_4  # apply patch
elif Version("0.10.2") == Version(vllm.__version__):
    ray_executor_class_v0 = safe_import_class(...)
elif Version("0.11.0") == Version(vllm.__version__):
    ray_executor_class_v0 = safe_import_class(...)
elif Version("0.12.0") == Version(vllm.__version__):
    ray_executor_class_v0 = None  # V0 deprecated
else:
    logger.warning(f"ROLL is not tested on vllm version {vllm.__version__}")

Memory allocator override from `roll/third_party/vllm/__init__.py:53-56`:

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = ""
torch.cuda.memory._set_allocator_settings("expandable_segments:False")

Spawn method requirement from `roll/third_party/vllm/__init__.py:65`:

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

Common Errors

Error Message Cause Solution
`ROLL is not tested on vllm version X` Unsupported vLLM version Install one of: 0.8.4, 0.10.2, 0.11.0, 0.12.0
`ROLL does not support using ray distributed executor` Version mismatch between vLLM and ROLL patches Match vLLM version to ROLL's tested versions
`CUDA out of memory` during inference GPU memory utilization too high Reduce `gpu_memory_utilization` (default 0.8) or use `sleep_level: 2`

Compatibility Notes

  • vLLM 0.12.0: V0 engine deprecated; only V1 executor available.
  • ROCm: vLLM V1 disabled (`VLLM_USE_V1=0`).
  • Ascend NPU: Uses `vllm-ascend` package with NPUWorker class.
  • Sleep Level: `1` destroys KV cache only; `2` destroys model weights and KV cache.
  • Load Format: Set to `dummy` since ROLL updates model weights at startup.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment