Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Vllm project Vllm Python Dependencies

From Leeroopedia


Metadata

Field Value
Page Type Environment
Project Vllm_project_Vllm
Environment Name Python_Dependencies
Category Runtime Dependencies
Platform Linux (x86_64, aarch64, arm64, s390x, ppc64le)
Python Version >= 3.9
Primary Package Manager pip
Last Updated 2026-02-08

Overview

This page documents the complete set of Python package dependencies required to build, install, and run vLLM — a high-throughput and memory-efficient inference and serving engine for large language models. The dependencies are organized into three tiers: core runtime dependencies (required on all platforms), CUDA-specific dependencies (required for NVIDIA GPU acceleration), and build dependencies (required only when compiling vLLM from source).

Description

vLLM relies on a carefully pinned and version-constrained dependency tree to ensure reproducible builds and stable inference behavior. The core dependencies cover tokenization, model loading, API serving, structured output generation, multimodal processing, and telemetry. CUDA-specific packages provide GPU-accelerated attention kernels and tensor operations. Build dependencies supply the compilation toolchain for vLLM's custom C++/CUDA extensions.

Dependencies are declared across three requirements files in the vLLM source tree:

  • requirements/common.txt — Core runtime packages needed on every platform.
  • requirements/cuda.txt — Packages specific to NVIDIA CUDA deployments.
  • requirements/build.txt — Packages required only for building from source.

Several packages carry platform markers (e.g., platform_machine == "x86_64") so that architecture-specific wheels are only installed where they are available.

Usage

Installing from PyPI

The simplest way to obtain all dependencies is to install vLLM itself:

pip install vllm

This pulls in the core and CUDA dependencies automatically via the package metadata.

Installing from Source

When building from a local checkout, install each requirements tier explicitly:

# Build dependencies (needed before compilation)
pip install -r requirements/build.txt

# Core runtime dependencies
pip install -r requirements/common.txt

# CUDA-specific dependencies
pip install -r requirements/cuda.txt

# Then build and install vLLM itself
pip install -e .

Verifying the Environment

After installation, confirm that key packages are at the expected versions:

python -c "import torch; print('torch', torch.__version__)"
python -c "import transformers; print('transformers', transformers.__version__)"
python -c "import vllm; print('vllm', vllm.__version__)"

System Requirements

Requirement Details
Operating System Linux (Ubuntu 20.04+, CentOS 7+, or equivalent)
Python >= 3.9
CUDA Toolkit >= 12.1 (for GPU deployments)
GPU NVIDIA with compute capability >= 7.0 (V100, A100, H100, etc.)
CPU Architectures x86_64, aarch64 / arm64, s390x, ppc64le
RAM >= 16 GB recommended (model-dependent)
Disk >= 10 GB for packages; additional space for model weights
C++ Compiler GCC >= 9 or compatible (for source builds)
CMake >= 3.26.1 (for source builds)

Dependencies

Core Runtime Dependencies

These packages are declared in requirements/common.txt and are required on all platforms.

Model Loading and Tokenization

Package Version Constraint Purpose
transformers >= 4.56.0, < 5 Hugging Face model loading, configuration parsing, and tokenizer support
tokenizers >= 0.21.1 Fast incremental detokenization for streaming token output
sentencepiece (any) Tokenizer backend for LLaMA-family models
tiktoken >= 0.6.0 Tokenizer backend for DBRX and GPT-family models
gguf >= 0.17.0 Loading models stored in the GGUF quantized format
compressed-tensors == 0.13.0 Loading and operating on compressed/quantized tensor formats
protobuf (any) Required by LlamaTokenizer and gRPC protocol definitions

API Serving

Package Version Constraint Purpose
openai >= 1.99.1 OpenAI-compatible API types; Responses API with reasoning content support
fastapi[standard] >= 0.115.0 HTTP API server framework with standard extras (uvicorn, etc.)
aiohttp >= 3.13.3 Asynchronous HTTP client for downstream calls and health checks
pydantic >= 2.12.0 Request/response validation and schema generation for API endpoints
requests >= 2.26.0 Synchronous HTTP client used in utilities and model downloading
grpcio (any) gRPC server runtime for the gRPC serving backend
grpcio-reflection (any) gRPC server reflection support for service discovery
anthropic >= 0.71.0 Anthropic API client integration
mcp (any) Model Context Protocol support

Structured Output and Guided Decoding

Package Version Constraint Platform Purpose
lm-format-enforcer == 0.11.3 All Grammar-based constrained decoding engine
outlines_core == 0.2.11 All Core engine for structured/guided text generation
xgrammar == 0.1.29 x86_64, aarch64, arm64, s390x, ppc64le Fast grammar-guided decoding with precompiled grammars
llguidance >= 1.3.0, < 1.4.0 x86_64, arm64, aarch64, s390x, ppc64le Low-level guidance engine for constrained generation
regex (any) All Higher-performance regular expression matching for grammar engines

Multimodal Processing

Package Version Constraint Purpose
pillow (any) Image loading, decoding, and preprocessing
opencv-python-headless >= 4.13.0 Video frame extraction and image I/O without GUI dependencies
mistral_common[image] >= 1.9.0 Mistral model multimodal (image) processing utilities
einops (any) Tensor rearrangement operations required by Qwen2-VL and similar models
blake3 (any) Fast hashing of multimodal inputs for caching and deduplication

Observability and Telemetry

Package Version Constraint Purpose
prometheus_client >= 0.18.0 Metrics exposition in Prometheus format for monitoring
setproctitle (any) Sets human-readable process titles for worker processes

Serialization and Communication

Package Version Constraint Purpose
pyzmq >= 25.0.0 ZeroMQ bindings for inter-process communication between engine components
msgspec (any) Fast MessagePack/JSON serialization for internal RPC messages
cloudpickle (any) Extended pickling for serializing closures and dynamic objects
cbor2 (any) CBOR encoding for cross-language serialization of structured data
pybase64 (any) Fast base64 encoding/decoding (e.g., for image data in API payloads)

Utilities

Package Version Constraint Purpose
numpy (any) Fundamental array operations used throughout the codebase
ninja (any) Build system used for JIT-compiling custom CUDA kernels at runtime
jinja2 >= 3.1.6 Template engine used for chat template rendering

CUDA-specific Dependencies

These packages are declared in requirements/cuda.txt and are required only for NVIDIA GPU deployments.

Package Version Constraint Purpose
torch == 2.9.1 PyTorch with CUDA support; core tensor and autograd framework
torchaudio == 2.9.1 Audio processing utilities (used by speech-to-text model pipelines)
torchvision == 0.24.1 Vision transforms and utilities; required by phi3v processor
flashinfer-python == 0.6.3 FlashInfer attention kernels for high-performance PagedAttention; version must match Dockerfile
numba == 0.61.2 JIT compilation for N-gram speculative decoding routines
ray[cgraph] >= 2.48.0 Distributed execution framework with compiled graph support for tensor parallelism

Build Dependencies

These packages are declared in requirements/build.txt and are required only when compiling vLLM from source.

Package Version Constraint Purpose
cmake >= 3.26.1 Cross-platform build system for C++/CUDA extensions
ninja (any) Fast parallel build executor
packaging >= 24.2 Version parsing and comparison utilities used by setup.py
setuptools >= 77.0.3, < 81.0.0 Python package build backend
setuptools-scm >= 8 Automatic version inference from git tags
wheel (any) Wheel archive builder
jinja2 >= 3.1.6 Template rendering for generated source files
grpcio-tools == 1.78.0 Protocol buffer compiler plugin for generating gRPC Python stubs

Credentials

No credentials are required to install the Python dependencies listed on this page. All packages are available from the public PyPI repository. If your deployment environment uses a private PyPI mirror or artifact registry, configure pip accordingly via pip.conf or the PIP_INDEX_URL environment variable.

Quick Install

# Create and activate a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate

# Option A: Install vLLM from PyPI (includes all required dependencies)
pip install vllm

# Option B: Install from source
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/common.txt
pip install -r requirements/cuda.txt
pip install -e .

Code Evidence

The following excerpts from the vLLM source tree illustrate key version constraints and their rationale.

Fast Incremental Detokenization (common.txt:11)

tokenizers >= 0.21.1  # Required for fast incremental detokenization.

The tokenizers package at version 0.21.1 or later provides an incremental detokenization API that vLLM uses to decode tokens into text progressively during streaming, avoiding the cost of re-decoding the entire sequence on each new token.

Platform-constrained xgrammar (common.txt:27)

xgrammar == 0.1.29; platform_machine == "x86_64" or platform_machine == "aarch64" or platform_machine == "arm64" or platform_machine == "s390x" or platform_machine == "ppc64le"

The xgrammar package is only installed on architectures where precompiled wheels are available. This platform marker prevents installation failures on unsupported architectures.

FlashInfer Version Pinning (cuda.txt:13)

# FlashInfer should be updated together with the Dockerfile
flashinfer-python==0.6.3

FlashInfer is pinned to an exact version because its CUDA kernels must match the compiled vLLM binary. Upgrading FlashInfer requires a coordinated update of the Dockerfile and the vLLM build to ensure ABI compatibility.

Common Errors

Version Conflict with transformers

ERROR: Cannot install transformers==4.55.0 because vllm requires transformers>=4.56.0,<5

Resolution: Upgrade transformers to a compatible version:

pip install "transformers>=4.56.0,<5"

Missing xgrammar on Unsupported Architecture

ModuleNotFoundError: No module named 'xgrammar'

Resolution: This is expected on architectures not in the platform marker list (e.g., Apple Silicon via Rosetta may report a different platform_machine). vLLM falls back to alternative guided decoding backends when xgrammar is unavailable.

FlashInfer Version Mismatch

RuntimeError: FlashInfer version mismatch. Expected 0.6.3, got 0.6.2

Resolution: Install the exact pinned version. FlashInfer must match the version that vLLM was compiled against:

pip install flashinfer-python==0.6.3

CUDA Toolkit Not Found During Build

CMake Error: Could not find CUDA toolkit. Set CUDA_HOME or ensure nvcc is in PATH.

Resolution: Set the CUDA_HOME environment variable or ensure that nvcc is on your PATH:

export CUDA_HOME=/usr/local/cuda

setuptools Version Too High

ERROR: setuptools 81.0.0 is incompatible with vllm build requirements (<81.0.0)

Resolution: Pin setuptools within the allowed range:

pip install "setuptools>=77.0.3,<81.0.0"

Numba / NumPy Incompatibility

ImportError: Numba needs NumPy 1.x or 2.x, but got NumPy 3.0.0

Resolution: Let pip resolve compatible versions by reinstalling numba, which will constrain numpy:

pip install numba==0.61.2

Compatibility Notes

  • Python version: vLLM targets Python >= 3.9. Some dependencies (notably xgrammar and flashinfer-python) may not yet publish wheels for the very latest Python release; check PyPI for wheel availability.
  • torch version: The CUDA requirements pin torch == 2.9.1. Using a different PyTorch version may cause ABI mismatches with vLLM's compiled CUDA extensions and FlashInfer kernels.
  • Platform markers: xgrammar and llguidance are restricted to specific CPU architectures. On other platforms, vLLM automatically falls back to lm-format-enforcer or outlines_core for structured output generation.
  • FlashInfer and Dockerfile: The FlashInfer version is tightly coupled with the Docker build. When upgrading FlashInfer, the Dockerfile and any CI/CD pipeline configurations must be updated simultaneously.
  • grpcio-tools version: The build-time grpcio-tools == 1.78.0 generates gRPC stubs that must be compatible with the runtime grpcio package. Mismatched versions can cause serialization errors.
  • compressed-tensors pinning: The exact pin (== 0.13.0) reflects a tight coupling with vLLM's quantization and weight-loading code paths. Upgrading requires testing against all supported quantization formats.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment