Implementation:Vllm project Vllm Setup

Knowledge Sources	vllm
Domains	Build_System, Configuration, Packaging
Last Updated	2026-02-08 00:00 GMT

Overview

Setuptools build script that orchestrates the compilation of vLLM's C++/CUDA/ROCm extensions via CMake, detects target hardware platforms, compiles gRPC protobuf definitions, and packages the project for distribution.

Description

setup.py is the central build orchestrator for the vLLM project. It auto-detects the target device (CUDA, ROCm, CPU, TPU, XPU) from the environment and PyTorch configuration, then uses CMake to compile platform-specific C++ and CUDA/HIP extensions. The file also handles gRPC proto compilation, precompiled wheel extraction, version string generation incorporating CUDA/ROCm versions, and manages platform-specific dependency resolution from requirements files.

The script defines several custom setuptools command classes (cmake_build_ext, precompiled_build_ext, BuildPyAndGenerateGrpc) and utility classes (CMakeExtension, precompiled_wheel_utils) to handle the complex multi-platform build pipeline. It supports compiler caching via sccache/ccache, ninja build parallelism, and NVCC thread configuration.

Usage

This file is invoked automatically by pip or setuptools when installing vLLM from source (e.g., pip install -e . or python setup.py build_ext --inplace). Developers interact with it indirectly through environment variables such as VLLM_TARGET_DEVICE, MAX_JOBS, NVCC_THREADS, and VLLM_USE_PRECOMPILED to control the build process.

Code Reference

Source Location

Repository: vllm
File: setup.py
Lines: 1-1062

Signature

def load_module_from_path(module_name, path) -> module
def is_sccache_available() -> bool
def is_ccache_available() -> bool
def is_ninja_available() -> bool
def is_freethreaded() -> bool
def compile_grpc_protos() -> bool
def get_nvcc_cuda_version() -> Version
def get_vllm_version() -> str
def get_requirements() -> list[str]

class BuildPyAndGenerateGrpc(build_py): ...
class DevelopAndGenerateGrpc(develop): ...
class CMakeExtension(Extension): ...
class cmake_build_ext(build_ext): ...
class precompiled_build_ext(build_ext): ...
class precompiled_wheel_utils: ...
class WheelLinkParser: ...

Import

# This file is not imported directly; it is executed by setuptools/pip.
# Example: pip install -e .
# Example: python setup.py build_ext --inplace

I/O Contract

Inputs

Name	Type	Required	Description
VLLM_TARGET_DEVICE	env var	No	Target device platform: cuda, rocm, cpu, tpu, xpu, empty (auto-detected if unset)
MAX_JOBS	env var	No	Maximum number of parallel compilation jobs
NVCC_THREADS	env var	No	Number of threads for NVCC parallel compilation (CUDA 11.2+)
VLLM_USE_PRECOMPILED	env var	No	If set, use a precompiled wheel instead of building from source
VLLM_DISABLE_SCCACHE	env var	No	Set to "1" to disable sccache even if available
CUDA_HOME	env var	No	Path to the CUDA toolkit installation
ROCM_HOME	env var	No	Path to the ROCm installation

Outputs

Name	Type	Description
vllm._C	shared library	Core C++/CUDA extension module
vllm._moe_C	shared library	Mixture-of-Experts C extension module
vllm._rocm_C	shared library	ROCm-specific C extension module (ROCm only)
vllm.vllm_flash_attn._vllm_fa2_C	shared library	Flash Attention 2 extension (CUDA only)
vllm.vllm_flash_attn._vllm_fa3_C	shared library	Flash Attention 3 extension (CUDA 12.3+ only)
gRPC stubs	Python files	Generated _pb2.py and _pb2_grpc.py from .proto files

Usage Examples

# Install vLLM from source with CUDA support (auto-detected)
# $ pip install -e .

# Install for a specific target device
# $ VLLM_TARGET_DEVICE=rocm pip install -e .

# Build with limited parallelism and precompiled binaries
# $ MAX_JOBS=4 VLLM_USE_PRECOMPILED=1 pip install -e .

# Build extensions in-place for development
# $ python setup.py build_ext --inplace

Related Pages

Environment:Vllm_project_Vllm_Python_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment