Implementation:Vllm project Vllm Setup
| Knowledge Sources | |
|---|---|
| Domains | Build_System, Configuration, Packaging |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Setuptools build script that orchestrates the compilation of vLLM's C++/CUDA/ROCm extensions via CMake, detects target hardware platforms, compiles gRPC protobuf definitions, and packages the project for distribution.
Description
setup.py is the central build orchestrator for the vLLM project. It auto-detects the target device (CUDA, ROCm, CPU, TPU, XPU) from the environment and PyTorch configuration, then uses CMake to compile platform-specific C++ and CUDA/HIP extensions. The file also handles gRPC proto compilation, precompiled wheel extraction, version string generation incorporating CUDA/ROCm versions, and manages platform-specific dependency resolution from requirements files.
The script defines several custom setuptools command classes (cmake_build_ext, precompiled_build_ext, BuildPyAndGenerateGrpc) and utility classes (CMakeExtension, precompiled_wheel_utils) to handle the complex multi-platform build pipeline. It supports compiler caching via sccache/ccache, ninja build parallelism, and NVCC thread configuration.
Usage
This file is invoked automatically by pip or setuptools when installing vLLM from source (e.g., pip install -e . or python setup.py build_ext --inplace). Developers interact with it indirectly through environment variables such as VLLM_TARGET_DEVICE, MAX_JOBS, NVCC_THREADS, and VLLM_USE_PRECOMPILED to control the build process.
Code Reference
Source Location
Signature
def load_module_from_path(module_name, path) -> module
def is_sccache_available() -> bool
def is_ccache_available() -> bool
def is_ninja_available() -> bool
def is_freethreaded() -> bool
def compile_grpc_protos() -> bool
def get_nvcc_cuda_version() -> Version
def get_vllm_version() -> str
def get_requirements() -> list[str]
class BuildPyAndGenerateGrpc(build_py): ...
class DevelopAndGenerateGrpc(develop): ...
class CMakeExtension(Extension): ...
class cmake_build_ext(build_ext): ...
class precompiled_build_ext(build_ext): ...
class precompiled_wheel_utils: ...
class WheelLinkParser: ...
Import
# This file is not imported directly; it is executed by setuptools/pip.
# Example: pip install -e .
# Example: python setup.py build_ext --inplace
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| VLLM_TARGET_DEVICE | env var | No | Target device platform: cuda, rocm, cpu, tpu, xpu, empty (auto-detected if unset) |
| MAX_JOBS | env var | No | Maximum number of parallel compilation jobs |
| NVCC_THREADS | env var | No | Number of threads for NVCC parallel compilation (CUDA 11.2+) |
| VLLM_USE_PRECOMPILED | env var | No | If set, use a precompiled wheel instead of building from source |
| VLLM_DISABLE_SCCACHE | env var | No | Set to "1" to disable sccache even if available |
| CUDA_HOME | env var | No | Path to the CUDA toolkit installation |
| ROCM_HOME | env var | No | Path to the ROCm installation |
Outputs
| Name | Type | Description |
|---|---|---|
| vllm._C | shared library | Core C++/CUDA extension module |
| vllm._moe_C | shared library | Mixture-of-Experts C extension module |
| vllm._rocm_C | shared library | ROCm-specific C extension module (ROCm only) |
| vllm.vllm_flash_attn._vllm_fa2_C | shared library | Flash Attention 2 extension (CUDA only) |
| vllm.vllm_flash_attn._vllm_fa3_C | shared library | Flash Attention 3 extension (CUDA 12.3+ only) |
| gRPC stubs | Python files | Generated *_pb2.py and *_pb2_grpc.py from .proto files |
Usage Examples
# Install vLLM from source with CUDA support (auto-detected)
# $ pip install -e .
# Install for a specific target device
# $ VLLM_TARGET_DEVICE=rocm pip install -e .
# Build with limited parallelism and precompiled binaries
# $ MAX_JOBS=4 VLLM_USE_PRECOMPILED=1 pip install -e .
# Build extensions in-place for development
# $ python setup.py build_ext --inplace