Environment:Vllm project Vllm CUDA
| Knowledge Sources | |
|---|---|
| Domains | GPU_Computing, CUDA |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
NVIDIA CUDA programming model environment for vLLM, providing the compiler toolchain, PTX ISA, and device-level programming abstractions required by vLLM's custom CUDA kernels, including Marlin quantization kernels and matrix multiply-accumulate (MMA) operations.
Description
This environment defines the CUDA programming model and compilation infrastructure that vLLM uses to build and execute custom GPU kernels. Unlike the higher-level CUDA runtime environment, this environment focuses on the low-level CUDA programming abstractions: the nvcc compiler, PTX (Parallel Thread Execution) intermediate representation, warp-level matrix operations (WMMA/MMA), shared memory management, and register allocation. vLLM's Marlin kernels use inline PTX assembly for warp-level matrix multiply-accumulate operations to achieve near-peak throughput for weight-only quantized inference (INT4/INT8 weights with FP16 accumulation). The kernels directly manipulate the GPU's warp schedulers, shared memory banks, and register files through PTX intrinsics for maximum hardware utilization.
Usage
This environment is required at build time when compiling vLLM from source. The nvcc compiler translates CUDA C++ source files (and inline PTX assembly) into device-specific binary code (SASS) or portable PTX. The CUDA_HOME or CUDA_PATH environment variable should point to the CUDA toolkit installation directory. Target GPU architectures are specified via TORCH_CUDA_ARCH_LIST or vLLM's CUDA_SUPPORTED_ARCHS in CMakeLists.txt.
Requirements
| Requirement | Value |
|---|---|
| CUDA Toolkit | 12.x (12.4+ recommended) |
| nvcc Compiler | Included with CUDA toolkit |
| PTX ISA | Version 7.0+ (for SM 7.0+ targets) |
| Host Compiler | GCC >= 9 or Clang (compatible with nvcc) |
| CMake | >= 3.26.1 |
| GPU Architectures | SM 7.0 through SM 10.0 (Volta through Blackwell) |
| CUDA_HOME | Environment variable pointing to CUDA toolkit root |