Environment:Vllm project Vllm Python Dependencies
Metadata
| Field | Value |
|---|---|
| Page Type | Environment |
| Project | Vllm_project_Vllm |
| Environment Name | Python_Dependencies |
| Category | Runtime Dependencies |
| Platform | Linux (x86_64, aarch64, arm64, s390x, ppc64le) |
| Python Version | >= 3.9 |
| Primary Package Manager | pip |
| Last Updated | 2026-02-08 |
Overview
This page documents the complete set of Python package dependencies required to build, install, and run vLLM — a high-throughput and memory-efficient inference and serving engine for large language models. The dependencies are organized into three tiers: core runtime dependencies (required on all platforms), CUDA-specific dependencies (required for NVIDIA GPU acceleration), and build dependencies (required only when compiling vLLM from source).
Description
vLLM relies on a carefully pinned and version-constrained dependency tree to ensure reproducible builds and stable inference behavior. The core dependencies cover tokenization, model loading, API serving, structured output generation, multimodal processing, and telemetry. CUDA-specific packages provide GPU-accelerated attention kernels and tensor operations. Build dependencies supply the compilation toolchain for vLLM's custom C++/CUDA extensions.
Dependencies are declared across three requirements files in the vLLM source tree:
requirements/common.txt— Core runtime packages needed on every platform.requirements/cuda.txt— Packages specific to NVIDIA CUDA deployments.requirements/build.txt— Packages required only for building from source.
Several packages carry platform markers (e.g., platform_machine == "x86_64") so that architecture-specific wheels are only installed where they are available.
Usage
Installing from PyPI
The simplest way to obtain all dependencies is to install vLLM itself:
pip install vllm
This pulls in the core and CUDA dependencies automatically via the package metadata.
Installing from Source
When building from a local checkout, install each requirements tier explicitly:
# Build dependencies (needed before compilation)
pip install -r requirements/build.txt
# Core runtime dependencies
pip install -r requirements/common.txt
# CUDA-specific dependencies
pip install -r requirements/cuda.txt
# Then build and install vLLM itself
pip install -e .
Verifying the Environment
After installation, confirm that key packages are at the expected versions:
python -c "import torch; print('torch', torch.__version__)"
python -c "import transformers; print('transformers', transformers.__version__)"
python -c "import vllm; print('vllm', vllm.__version__)"
System Requirements
| Requirement | Details |
|---|---|
| Operating System | Linux (Ubuntu 20.04+, CentOS 7+, or equivalent) |
| Python | >= 3.9 |
| CUDA Toolkit | >= 12.1 (for GPU deployments) |
| GPU | NVIDIA with compute capability >= 7.0 (V100, A100, H100, etc.) |
| CPU Architectures | x86_64, aarch64 / arm64, s390x, ppc64le |
| RAM | >= 16 GB recommended (model-dependent) |
| Disk | >= 10 GB for packages; additional space for model weights |
| C++ Compiler | GCC >= 9 or compatible (for source builds) |
| CMake | >= 3.26.1 (for source builds) |
Dependencies
Core Runtime Dependencies
These packages are declared in requirements/common.txt and are required on all platforms.
Model Loading and Tokenization
| Package | Version Constraint | Purpose |
|---|---|---|
| transformers | >= 4.56.0, < 5 | Hugging Face model loading, configuration parsing, and tokenizer support |
| tokenizers | >= 0.21.1 | Fast incremental detokenization for streaming token output |
| sentencepiece | (any) | Tokenizer backend for LLaMA-family models |
| tiktoken | >= 0.6.0 | Tokenizer backend for DBRX and GPT-family models |
| gguf | >= 0.17.0 | Loading models stored in the GGUF quantized format |
| compressed-tensors | == 0.13.0 | Loading and operating on compressed/quantized tensor formats |
| protobuf | (any) | Required by LlamaTokenizer and gRPC protocol definitions |
API Serving
| Package | Version Constraint | Purpose |
|---|---|---|
| openai | >= 1.99.1 | OpenAI-compatible API types; Responses API with reasoning content support |
| fastapi[standard] | >= 0.115.0 | HTTP API server framework with standard extras (uvicorn, etc.) |
| aiohttp | >= 3.13.3 | Asynchronous HTTP client for downstream calls and health checks |
| pydantic | >= 2.12.0 | Request/response validation and schema generation for API endpoints |
| requests | >= 2.26.0 | Synchronous HTTP client used in utilities and model downloading |
| grpcio | (any) | gRPC server runtime for the gRPC serving backend |
| grpcio-reflection | (any) | gRPC server reflection support for service discovery |
| anthropic | >= 0.71.0 | Anthropic API client integration |
| mcp | (any) | Model Context Protocol support |
Structured Output and Guided Decoding
| Package | Version Constraint | Platform | Purpose |
|---|---|---|---|
| lm-format-enforcer | == 0.11.3 | All | Grammar-based constrained decoding engine |
| outlines_core | == 0.2.11 | All | Core engine for structured/guided text generation |
| xgrammar | == 0.1.29 | x86_64, aarch64, arm64, s390x, ppc64le | Fast grammar-guided decoding with precompiled grammars |
| llguidance | >= 1.3.0, < 1.4.0 | x86_64, arm64, aarch64, s390x, ppc64le | Low-level guidance engine for constrained generation |
| regex | (any) | All | Higher-performance regular expression matching for grammar engines |
Multimodal Processing
| Package | Version Constraint | Purpose |
|---|---|---|
| pillow | (any) | Image loading, decoding, and preprocessing |
| opencv-python-headless | >= 4.13.0 | Video frame extraction and image I/O without GUI dependencies |
| mistral_common[image] | >= 1.9.0 | Mistral model multimodal (image) processing utilities |
| einops | (any) | Tensor rearrangement operations required by Qwen2-VL and similar models |
| blake3 | (any) | Fast hashing of multimodal inputs for caching and deduplication |
Observability and Telemetry
| Package | Version Constraint | Purpose |
|---|---|---|
| prometheus_client | >= 0.18.0 | Metrics exposition in Prometheus format for monitoring |
| setproctitle | (any) | Sets human-readable process titles for worker processes |
Serialization and Communication
| Package | Version Constraint | Purpose |
|---|---|---|
| pyzmq | >= 25.0.0 | ZeroMQ bindings for inter-process communication between engine components |
| msgspec | (any) | Fast MessagePack/JSON serialization for internal RPC messages |
| cloudpickle | (any) | Extended pickling for serializing closures and dynamic objects |
| cbor2 | (any) | CBOR encoding for cross-language serialization of structured data |
| pybase64 | (any) | Fast base64 encoding/decoding (e.g., for image data in API payloads) |
Utilities
| Package | Version Constraint | Purpose |
|---|---|---|
| numpy | (any) | Fundamental array operations used throughout the codebase |
| ninja | (any) | Build system used for JIT-compiling custom CUDA kernels at runtime |
| jinja2 | >= 3.1.6 | Template engine used for chat template rendering |
CUDA-specific Dependencies
These packages are declared in requirements/cuda.txt and are required only for NVIDIA GPU deployments.
| Package | Version Constraint | Purpose |
|---|---|---|
| torch | == 2.9.1 | PyTorch with CUDA support; core tensor and autograd framework |
| torchaudio | == 2.9.1 | Audio processing utilities (used by speech-to-text model pipelines) |
| torchvision | == 0.24.1 | Vision transforms and utilities; required by phi3v processor |
| flashinfer-python | == 0.6.3 | FlashInfer attention kernels for high-performance PagedAttention; version must match Dockerfile |
| numba | == 0.61.2 | JIT compilation for N-gram speculative decoding routines |
| ray[cgraph] | >= 2.48.0 | Distributed execution framework with compiled graph support for tensor parallelism |
Build Dependencies
These packages are declared in requirements/build.txt and are required only when compiling vLLM from source.
| Package | Version Constraint | Purpose |
|---|---|---|
| cmake | >= 3.26.1 | Cross-platform build system for C++/CUDA extensions |
| ninja | (any) | Fast parallel build executor |
| packaging | >= 24.2 | Version parsing and comparison utilities used by setup.py |
| setuptools | >= 77.0.3, < 81.0.0 | Python package build backend |
| setuptools-scm | >= 8 | Automatic version inference from git tags |
| wheel | (any) | Wheel archive builder |
| jinja2 | >= 3.1.6 | Template rendering for generated source files |
| grpcio-tools | == 1.78.0 | Protocol buffer compiler plugin for generating gRPC Python stubs |
Credentials
No credentials are required to install the Python dependencies listed on this page. All packages are available from the public PyPI repository. If your deployment environment uses a private PyPI mirror or artifact registry, configure pip accordingly via pip.conf or the PIP_INDEX_URL environment variable.
Quick Install
# Create and activate a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate
# Option A: Install vLLM from PyPI (includes all required dependencies)
pip install vllm
# Option B: Install from source
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/common.txt
pip install -r requirements/cuda.txt
pip install -e .
Code Evidence
The following excerpts from the vLLM source tree illustrate key version constraints and their rationale.
Fast Incremental Detokenization (common.txt:11)
tokenizers >= 0.21.1 # Required for fast incremental detokenization.
The tokenizers package at version 0.21.1 or later provides an incremental detokenization API that vLLM uses to decode tokens into text progressively during streaming, avoiding the cost of re-decoding the entire sequence on each new token.
Platform-constrained xgrammar (common.txt:27)
xgrammar == 0.1.29; platform_machine == "x86_64" or platform_machine == "aarch64" or platform_machine == "arm64" or platform_machine == "s390x" or platform_machine == "ppc64le"
The xgrammar package is only installed on architectures where precompiled wheels are available. This platform marker prevents installation failures on unsupported architectures.
FlashInfer Version Pinning (cuda.txt:13)
# FlashInfer should be updated together with the Dockerfile
flashinfer-python==0.6.3
FlashInfer is pinned to an exact version because its CUDA kernels must match the compiled vLLM binary. Upgrading FlashInfer requires a coordinated update of the Dockerfile and the vLLM build to ensure ABI compatibility.
Common Errors
Version Conflict with transformers
ERROR: Cannot install transformers==4.55.0 because vllm requires transformers>=4.56.0,<5
Resolution: Upgrade transformers to a compatible version:
pip install "transformers>=4.56.0,<5"
Missing xgrammar on Unsupported Architecture
ModuleNotFoundError: No module named 'xgrammar'
Resolution: This is expected on architectures not in the platform marker list (e.g., Apple Silicon via Rosetta may report a different platform_machine). vLLM falls back to alternative guided decoding backends when xgrammar is unavailable.
FlashInfer Version Mismatch
RuntimeError: FlashInfer version mismatch. Expected 0.6.3, got 0.6.2
Resolution: Install the exact pinned version. FlashInfer must match the version that vLLM was compiled against:
pip install flashinfer-python==0.6.3
CUDA Toolkit Not Found During Build
CMake Error: Could not find CUDA toolkit. Set CUDA_HOME or ensure nvcc is in PATH.
Resolution: Set the CUDA_HOME environment variable or ensure that nvcc is on your PATH:
export CUDA_HOME=/usr/local/cuda
setuptools Version Too High
ERROR: setuptools 81.0.0 is incompatible with vllm build requirements (<81.0.0)
Resolution: Pin setuptools within the allowed range:
pip install "setuptools>=77.0.3,<81.0.0"
Numba / NumPy Incompatibility
ImportError: Numba needs NumPy 1.x or 2.x, but got NumPy 3.0.0
Resolution: Let pip resolve compatible versions by reinstalling numba, which will constrain numpy:
pip install numba==0.61.2
Compatibility Notes
- Python version: vLLM targets Python >= 3.9. Some dependencies (notably
xgrammarandflashinfer-python) may not yet publish wheels for the very latest Python release; check PyPI for wheel availability. - torch version: The CUDA requirements pin
torch == 2.9.1. Using a different PyTorch version may cause ABI mismatches with vLLM's compiled CUDA extensions and FlashInfer kernels. - Platform markers:
xgrammarandllguidanceare restricted to specific CPU architectures. On other platforms, vLLM automatically falls back tolm-format-enforceroroutlines_corefor structured output generation. - FlashInfer and Dockerfile: The FlashInfer version is tightly coupled with the Docker build. When upgrading FlashInfer, the Dockerfile and any CI/CD pipeline configurations must be updated simultaneously.
- grpcio-tools version: The build-time
grpcio-tools == 1.78.0generates gRPC stubs that must be compatible with the runtimegrpciopackage. Mismatched versions can cause serialization errors. - compressed-tensors pinning: The exact pin (
== 0.13.0) reflects a tight coupling with vLLM's quantization and weight-loading code paths. Upgrading requires testing against all supported quantization formats.
Related Pages
- Implementation:Vllm_project_Vllm_Pip_Install_Vllm — Installation procedure that consumes these dependencies.
- Implementation:Vllm_project_Vllm_Chat_Template_Application — Chat template rendering that relies on jinja2 and tokenizers from this environment.
- Implementation:Vllm_project_Vllm_Pydantic_Schema_Generation — Pydantic-based schema generation for the OpenAI-compatible API.
- Implementation:Vllm_project_Vllm_OpenAI_Chat_Completions — OpenAI chat completions endpoint that depends on openai, fastapi, and pydantic from this environment.