Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Setup

From Leeroopedia


Knowledge Sources
Domains Build_System, Configuration, Packaging
Last Updated 2026-02-08 00:00 GMT

Overview

Setuptools build script that orchestrates the compilation of vLLM's C++/CUDA/ROCm extensions via CMake, detects target hardware platforms, compiles gRPC protobuf definitions, and packages the project for distribution.

Description

setup.py is the central build orchestrator for the vLLM project. It auto-detects the target device (CUDA, ROCm, CPU, TPU, XPU) from the environment and PyTorch configuration, then uses CMake to compile platform-specific C++ and CUDA/HIP extensions. The file also handles gRPC proto compilation, precompiled wheel extraction, version string generation incorporating CUDA/ROCm versions, and manages platform-specific dependency resolution from requirements files.

The script defines several custom setuptools command classes (cmake_build_ext, precompiled_build_ext, BuildPyAndGenerateGrpc) and utility classes (CMakeExtension, precompiled_wheel_utils) to handle the complex multi-platform build pipeline. It supports compiler caching via sccache/ccache, ninja build parallelism, and NVCC thread configuration.

Usage

This file is invoked automatically by pip or setuptools when installing vLLM from source (e.g., pip install -e . or python setup.py build_ext --inplace). Developers interact with it indirectly through environment variables such as VLLM_TARGET_DEVICE, MAX_JOBS, NVCC_THREADS, and VLLM_USE_PRECOMPILED to control the build process.

Code Reference

Source Location

Signature

def load_module_from_path(module_name, path) -> module
def is_sccache_available() -> bool
def is_ccache_available() -> bool
def is_ninja_available() -> bool
def is_freethreaded() -> bool
def compile_grpc_protos() -> bool
def get_nvcc_cuda_version() -> Version
def get_vllm_version() -> str
def get_requirements() -> list[str]

class BuildPyAndGenerateGrpc(build_py): ...
class DevelopAndGenerateGrpc(develop): ...
class CMakeExtension(Extension): ...
class cmake_build_ext(build_ext): ...
class precompiled_build_ext(build_ext): ...
class precompiled_wheel_utils: ...
class WheelLinkParser: ...

Import

# This file is not imported directly; it is executed by setuptools/pip.
# Example: pip install -e .
# Example: python setup.py build_ext --inplace

I/O Contract

Inputs

Name Type Required Description
VLLM_TARGET_DEVICE env var No Target device platform: cuda, rocm, cpu, tpu, xpu, empty (auto-detected if unset)
MAX_JOBS env var No Maximum number of parallel compilation jobs
NVCC_THREADS env var No Number of threads for NVCC parallel compilation (CUDA 11.2+)
VLLM_USE_PRECOMPILED env var No If set, use a precompiled wheel instead of building from source
VLLM_DISABLE_SCCACHE env var No Set to "1" to disable sccache even if available
CUDA_HOME env var No Path to the CUDA toolkit installation
ROCM_HOME env var No Path to the ROCm installation

Outputs

Name Type Description
vllm._C shared library Core C++/CUDA extension module
vllm._moe_C shared library Mixture-of-Experts C extension module
vllm._rocm_C shared library ROCm-specific C extension module (ROCm only)
vllm.vllm_flash_attn._vllm_fa2_C shared library Flash Attention 2 extension (CUDA only)
vllm.vllm_flash_attn._vllm_fa3_C shared library Flash Attention 3 extension (CUDA 12.3+ only)
gRPC stubs Python files Generated *_pb2.py and *_pb2_grpc.py from .proto files

Usage Examples

# Install vLLM from source with CUDA support (auto-detected)
# $ pip install -e .

# Install for a specific target device
# $ VLLM_TARGET_DEVICE=rocm pip install -e .

# Build with limited parallelism and precompiled binaries
# $ MAX_JOBS=4 VLLM_USE_PRECOMPILED=1 pip install -e .

# Build extensions in-place for development
# $ python setup.py build_ext --inplace

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment