Environment:Vllm project Vllm CUDA GPU Runtime
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, GPU_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
NVIDIA CUDA GPU runtime environment required by vLLM for high-throughput LLM inference, providing GPU detection, compute capability gating, and optimized attention backends across Volta through Blackwell architectures.
Description
This environment defines the NVIDIA CUDA GPU stack that vLLM requires at runtime. vLLM auto-detects GPU hardware through NVML (pynvml) and falls back to a non-NVML code path on platforms such as Jetson where NVML is unavailable. The runtime enforces minimum compute capability thresholds for specific features: BFloat16 requires compute capability >= 8.0 (Ampere+), FP8 quantization requires >= 8.9 (Ada Lovelace+), and FlashAttention requires >= 8.0. cuDNN SDP is explicitly disabled at import time to prevent crashes on certain models (a PyTorch 2.5+ regression). For distributed inference, NCCL is the communication backend, and CUDA_VISIBLE_DEVICES plus LOCAL_RANK control device placement across multi-GPU setups.
Usage
Use this environment for all vLLM inference workflows that target NVIDIA GPUs. This includes the offline LLM class, the online vllm serve API server, and the EngineArgs configuration path. Any Implementation that calls into the vLLM engine on a CUDA device depends on this environment being satisfied.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) | WSL partially supported but pin_memory is disabled (interface.py:452-460) |
| Python | >= 3.10, < 3.14 | Specified in pyproject.toml line 34 |
| Hardware | NVIDIA GPU, compute capability >= 7.0 | Volta (V100) minimum; see CUDA_SUPPORTED_ARCHS in CMakeLists.txt |
| Hardware (BFloat16) | Compute capability >= 8.0 | Ampere (A100) or newer required |
| Hardware (FP8) | Compute capability >= 8.9 | Ada Lovelace (L40, RTX 4090) or newer required |
| Hardware (FlashAttention) | Compute capability >= 8.0 | Ampere or newer required (cuda.py:399) |
| CUDA Toolkit | 12.9 (default) | VLLM_MAIN_CUDA_VERSION in vllm/envs.py line 75 |
| Build Tools | cmake >= 3.26.1, ninja | Required for C++/CUDA extension compilation (pyproject.toml) |
| VRAM | 16GB+ for 7B models | 24GB+ recommended; 80GB for 70B models |
| Disk | 50GB+ SSD | For model weights and KV cache |
Dependencies
System Packages
nvidia-driver>= 535 (for CUDA 12.x)cuda-toolkit12.9 (matching VLLM_MAIN_CUDA_VERSION)nccl(distributed communication backend)
Python Packages (Core)
torch== 2.9.1 (requirements/cuda.txt)flashinfer-python== 0.6.3 (attention backend for FlashInfer)numba== 0.61.2 (N-gram speculative decoding)pynvml(GPU detection via NVML; cuda.py:37)
Supported CUDA Architectures
From CMakeLists.txt (CUDA_SUPPORTED_ARCHS):
- 7.0 - Volta (V100)
- 7.2 - Jetson Xavier
- 7.5 - Turing (T4, RTX 2080)
- 8.0 - Ampere (A100) -- BFloat16 + FlashAttention enabled
- 8.6 - Ampere (RTX 3090, A40)
- 8.7 - Jetson Orin
- 8.9 - Ada Lovelace (L40, RTX 4090) -- FP8 enabled
- 9.0 - Hopper (H100, H200)
- 10.0 - Blackwell (B100, B200)
Credentials
No GPU-specific credentials are required. The following environment variables control runtime behavior (from vllm/envs.py):
| Variable | Default | Purpose |
|---|---|---|
VLLM_TARGET_DEVICE |
"cuda" |
Selects the target device platform |
CUDA_VISIBLE_DEVICES |
(system default) | Restricts which GPUs are visible to the process |
CUDA_HOME |
(auto-detected) | Path to the CUDA toolkit installation |
VLLM_NCCL_SO_PATH |
(auto-detected) | Custom path to the NCCL shared library |
LOCAL_RANK |
0 | Process rank for multi-GPU distributed inference |
CUDA_DEVICE_ORDER |
(system default) | Set to PCI_BUS_ID for mixed GPU setups (cuda.py:583-590)
|
Quick Install
# Install vLLM with CUDA support
pip install vllm
# Verify CUDA availability and compute capability
python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
cap = torch.cuda.get_device_capability(0)
print(f'Compute capability: {cap[0]}.{cap[1]}')
print(f'BFloat16 supported: {cap[0] >= 8}')
print(f'FP8 supported: {cap[0]*10 + cap[1] >= 89}')
"
# Verify NVML detection
python -c "import pynvml; pynvml.nvmlInit(); print('NVML OK')"
Code Evidence
BFloat16 compute capability check from vllm/platforms/cuda.py:443-460:
if dtype == torch.bfloat16:
if not cls.has_device_capability(80):
raise ValueError(
"Bfloat16 is only supported on GPUs "
"with compute capability of at least 8.0. "
)
FP8 support check from vllm/platforms/cuda.py:422-423:
@classmethod
def supports_fp8(cls) -> bool:
return cls.has_device_capability(89)
Platform detection via NVML from vllm/platforms/cuda.py:620-632:
nvml_available = False
try:
pynvml.nvmlInit()
nvml_available = True
except Exception:
nvml_available = False
CudaPlatform = NvmlCudaPlatform if nvml_available else NonNvmlCudaPlatform
cuDNN SDP disabled to prevent crashes from vllm/platforms/cuda.py:41:
# pytorch 2.5 uses cudnn sdpa by default, which will cause crash on some models
torch.backends.cuda.enable_cudnn_sdp(False)
NCCL distributed backend from vllm/platforms/cuda.py:103:
dist_backend = "nccl"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0 |
GPU compute capability < 8.0 (pre-Ampere) | Use --dtype=half (FP16) instead of BFloat16
|
No valid attention backend found |
GPU compute capability too low for the selected attention backend | Switch to a compatible backend; FlashAttention requires SM >= 8.0 |
CUDA out of memory |
Insufficient VRAM for the model and KV cache | Reduce --gpu-memory-utilization (default 0.9), use a smaller model, or enable tensor parallelism
|
Failed to import from vllm._C |
CUDA toolkit version mismatch or incomplete build | Reinstall vLLM with matching CUDA version; verify nvcc --version matches expected CUDA 12.9
|
| NVML initialization failure | NVML/pynvml not available (e.g., Jetson) | Falls back to NonNvmlCudaPlatform automatically; no user action required |
Compatibility Notes
- Blackwell (SM 10.0): FlashInfer MLA is the preferred attention backend. Full support for all dtype modes.
- Hopper (SM 9.0): Flash Attention is the preferred backend. FP8 quantization supported.
- Ada Lovelace (SM 8.9): FP8 quantization supported. Flash Attention preferred.
- Ampere (SM 8.0-8.7): BFloat16 and Flash Attention supported. FP8 not available.
- Turing (SM 7.5): FP16 only. No BFloat16, no FP8, no FlashAttention.
- Volta (SM 7.0): FP16 only. Minimum supported architecture.
- Jetson Devices: NVML is not supported; vLLM automatically falls back to
NonNvmlCudaPlatform(cuda.py:620-632). - WSL (Windows Subsystem for Linux):
pin_memoryis disabled, which may reduce data loading performance (interface.py:452-460). - Mixed GPU Setups: Set
CUDA_DEVICE_ORDER=PCI_BUS_IDto ensure consistent device ordering (cuda.py:583-590). - cuDNN SDP: Explicitly disabled at import time to avoid crashes introduced in PyTorch 2.5+ (cuda.py:41).