Environment:Vllm project Vllm CUDA GPU Runtime

Knowledge Sources	vLLM NVIDIA CUDA Toolkit
Domains	Infrastructure, Deep_Learning, GPU_Computing
Last Updated	2026-02-08 00:00 GMT

Overview

NVIDIA CUDA GPU runtime environment required by vLLM for high-throughput LLM inference, providing GPU detection, compute capability gating, and optimized attention backends across Volta through Blackwell architectures.

Description

This environment defines the NVIDIA CUDA GPU stack that vLLM requires at runtime. vLLM auto-detects GPU hardware through NVML (pynvml) and falls back to a non-NVML code path on platforms such as Jetson where NVML is unavailable. The runtime enforces minimum compute capability thresholds for specific features: BFloat16 requires compute capability >= 8.0 (Ampere+), FP8 quantization requires >= 8.9 (Ada Lovelace+), and FlashAttention requires >= 8.0. cuDNN SDP is explicitly disabled at import time to prevent crashes on certain models (a PyTorch 2.5+ regression). For distributed inference, NCCL is the communication backend, and CUDA_VISIBLE_DEVICES plus LOCAL_RANK control device placement across multi-GPU setups.

Usage

Use this environment for all vLLM inference workflows that target NVIDIA GPUs. This includes the offline LLM class, the online vllm serve API server, and the EngineArgs configuration path. Any Implementation that calls into the vLLM engine on a CUDA device depends on this environment being satisfied.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+ recommended)	WSL partially supported but pin_memory is disabled (interface.py:452-460)
Python	>= 3.10, < 3.14	Specified in pyproject.toml line 34
Hardware	NVIDIA GPU, compute capability >= 7.0	Volta (V100) minimum; see CUDA_SUPPORTED_ARCHS in CMakeLists.txt
Hardware (BFloat16)	Compute capability >= 8.0	Ampere (A100) or newer required
Hardware (FP8)	Compute capability >= 8.9	Ada Lovelace (L40, RTX 4090) or newer required
Hardware (FlashAttention)	Compute capability >= 8.0	Ampere or newer required (cuda.py:399)
CUDA Toolkit	12.9 (default)	VLLM_MAIN_CUDA_VERSION in vllm/envs.py line 75
Build Tools	cmake >= 3.26.1, ninja	Required for C++/CUDA extension compilation (pyproject.toml)
VRAM	16GB+ for 7B models	24GB+ recommended; 80GB for 70B models
Disk	50GB+ SSD	For model weights and KV cache

Dependencies

System Packages

nvidia-driver >= 535 (for CUDA 12.x)
cuda-toolkit 12.9 (matching VLLM_MAIN_CUDA_VERSION)
nccl (distributed communication backend)

Python Packages (Core)

torch == 2.9.1 (requirements/cuda.txt)
flashinfer-python == 0.6.3 (attention backend for FlashInfer)
numba == 0.61.2 (N-gram speculative decoding)
pynvml (GPU detection via NVML; cuda.py:37)

Supported CUDA Architectures

From CMakeLists.txt (CUDA_SUPPORTED_ARCHS):

7.0 - Volta (V100)
7.2 - Jetson Xavier
7.5 - Turing (T4, RTX 2080)
8.0 - Ampere (A100) -- BFloat16 + FlashAttention enabled
8.6 - Ampere (RTX 3090, A40)
8.7 - Jetson Orin
8.9 - Ada Lovelace (L40, RTX 4090) -- FP8 enabled
9.0 - Hopper (H100, H200)
10.0 - Blackwell (B100, B200)

Credentials

No GPU-specific credentials are required. The following environment variables control runtime behavior (from vllm/envs.py):

Variable	Default	Purpose
`VLLM_TARGET_DEVICE`	`"cuda"`	Selects the target device platform
`CUDA_VISIBLE_DEVICES`	(system default)	Restricts which GPUs are visible to the process
`CUDA_HOME`	(auto-detected)	Path to the CUDA toolkit installation
`VLLM_NCCL_SO_PATH`	(auto-detected)	Custom path to the NCCL shared library
`LOCAL_RANK`	0	Process rank for multi-GPU distributed inference
`CUDA_DEVICE_ORDER`	(system default)	Set to `PCI_BUS_ID` for mixed GPU setups (cuda.py:583-590)

Quick Install

# Install vLLM with CUDA support
pip install vllm

# Verify CUDA availability and compute capability
python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
cap = torch.cuda.get_device_capability(0)
print(f'Compute capability: {cap[0]}.{cap[1]}')
print(f'BFloat16 supported: {cap[0] >= 8}')
print(f'FP8 supported: {cap[0]*10 + cap[1] >= 89}')
"

# Verify NVML detection
python -c "import pynvml; pynvml.nvmlInit(); print('NVML OK')"

Code Evidence

BFloat16 compute capability check from vllm/platforms/cuda.py:443-460:

if dtype == torch.bfloat16:
    if not cls.has_device_capability(80):
        raise ValueError(
            "Bfloat16 is only supported on GPUs "
            "with compute capability of at least 8.0. "
        )

FP8 support check from vllm/platforms/cuda.py:422-423:

@classmethod
def supports_fp8(cls) -> bool:
    return cls.has_device_capability(89)

Platform detection via NVML from vllm/platforms/cuda.py:620-632:

nvml_available = False
try:
    pynvml.nvmlInit()
    nvml_available = True
except Exception:
    nvml_available = False
CudaPlatform = NvmlCudaPlatform if nvml_available else NonNvmlCudaPlatform

cuDNN SDP disabled to prevent crashes from vllm/platforms/cuda.py:41:

# pytorch 2.5 uses cudnn sdpa by default, which will cause crash on some models
torch.backends.cuda.enable_cudnn_sdp(False)

NCCL distributed backend from vllm/platforms/cuda.py:103:

dist_backend = "nccl"

Common Errors

Error Message	Cause	Solution
`ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0`	GPU compute capability < 8.0 (pre-Ampere)	Use `--dtype=half` (FP16) instead of BFloat16
`No valid attention backend found`	GPU compute capability too low for the selected attention backend	Switch to a compatible backend; FlashAttention requires SM >= 8.0
`CUDA out of memory`	Insufficient VRAM for the model and KV cache	Reduce `--gpu-memory-utilization` (default 0.9), use a smaller model, or enable tensor parallelism
`Failed to import from vllm._C`	CUDA toolkit version mismatch or incomplete build	Reinstall vLLM with matching CUDA version; verify `nvcc --version` matches expected CUDA 12.9
NVML initialization failure	NVML/pynvml not available (e.g., Jetson)	Falls back to NonNvmlCudaPlatform automatically; no user action required

Compatibility Notes

Blackwell (SM 10.0): FlashInfer MLA is the preferred attention backend. Full support for all dtype modes.
Hopper (SM 9.0): Flash Attention is the preferred backend. FP8 quantization supported.
Ada Lovelace (SM 8.9): FP8 quantization supported. Flash Attention preferred.
Ampere (SM 8.0-8.7): BFloat16 and Flash Attention supported. FP8 not available.
Turing (SM 7.5): FP16 only. No BFloat16, no FP8, no FlashAttention.
Volta (SM 7.0): FP16 only. Minimum supported architecture.
Jetson Devices: NVML is not supported; vLLM automatically falls back to NonNvmlCudaPlatform (cuda.py:620-632).
WSL (Windows Subsystem for Linux): pin_memory is disabled, which may reduce data loading performance (interface.py:452-460).
Mixed GPU Setups: Set CUDA_DEVICE_ORDER=PCI_BUS_ID to ensure consistent device ordering (cuda.py:583-590).
cuDNN SDP: Explicitly disabled at import time to avoid crashes introduced in PyTorch 2.5+ (cuda.py:41).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment