Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Vllm project Vllm CUDA GPU Runtime

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, GPU_Computing
Last Updated 2026-02-08 00:00 GMT

Overview

NVIDIA CUDA GPU runtime environment required by vLLM for high-throughput LLM inference, providing GPU detection, compute capability gating, and optimized attention backends across Volta through Blackwell architectures.

Description

This environment defines the NVIDIA CUDA GPU stack that vLLM requires at runtime. vLLM auto-detects GPU hardware through NVML (pynvml) and falls back to a non-NVML code path on platforms such as Jetson where NVML is unavailable. The runtime enforces minimum compute capability thresholds for specific features: BFloat16 requires compute capability >= 8.0 (Ampere+), FP8 quantization requires >= 8.9 (Ada Lovelace+), and FlashAttention requires >= 8.0. cuDNN SDP is explicitly disabled at import time to prevent crashes on certain models (a PyTorch 2.5+ regression). For distributed inference, NCCL is the communication backend, and CUDA_VISIBLE_DEVICES plus LOCAL_RANK control device placement across multi-GPU setups.

Usage

Use this environment for all vLLM inference workflows that target NVIDIA GPUs. This includes the offline LLM class, the online vllm serve API server, and the EngineArgs configuration path. Any Implementation that calls into the vLLM engine on a CUDA device depends on this environment being satisfied.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+ recommended) WSL partially supported but pin_memory is disabled (interface.py:452-460)
Python >= 3.10, < 3.14 Specified in pyproject.toml line 34
Hardware NVIDIA GPU, compute capability >= 7.0 Volta (V100) minimum; see CUDA_SUPPORTED_ARCHS in CMakeLists.txt
Hardware (BFloat16) Compute capability >= 8.0 Ampere (A100) or newer required
Hardware (FP8) Compute capability >= 8.9 Ada Lovelace (L40, RTX 4090) or newer required
Hardware (FlashAttention) Compute capability >= 8.0 Ampere or newer required (cuda.py:399)
CUDA Toolkit 12.9 (default) VLLM_MAIN_CUDA_VERSION in vllm/envs.py line 75
Build Tools cmake >= 3.26.1, ninja Required for C++/CUDA extension compilation (pyproject.toml)
VRAM 16GB+ for 7B models 24GB+ recommended; 80GB for 70B models
Disk 50GB+ SSD For model weights and KV cache

Dependencies

System Packages

  • nvidia-driver >= 535 (for CUDA 12.x)
  • cuda-toolkit 12.9 (matching VLLM_MAIN_CUDA_VERSION)
  • nccl (distributed communication backend)

Python Packages (Core)

  • torch == 2.9.1 (requirements/cuda.txt)
  • flashinfer-python == 0.6.3 (attention backend for FlashInfer)
  • numba == 0.61.2 (N-gram speculative decoding)
  • pynvml (GPU detection via NVML; cuda.py:37)

Supported CUDA Architectures

From CMakeLists.txt (CUDA_SUPPORTED_ARCHS):

  • 7.0 - Volta (V100)
  • 7.2 - Jetson Xavier
  • 7.5 - Turing (T4, RTX 2080)
  • 8.0 - Ampere (A100) -- BFloat16 + FlashAttention enabled
  • 8.6 - Ampere (RTX 3090, A40)
  • 8.7 - Jetson Orin
  • 8.9 - Ada Lovelace (L40, RTX 4090) -- FP8 enabled
  • 9.0 - Hopper (H100, H200)
  • 10.0 - Blackwell (B100, B200)

Credentials

No GPU-specific credentials are required. The following environment variables control runtime behavior (from vllm/envs.py):

Variable Default Purpose
VLLM_TARGET_DEVICE "cuda" Selects the target device platform
CUDA_VISIBLE_DEVICES (system default) Restricts which GPUs are visible to the process
CUDA_HOME (auto-detected) Path to the CUDA toolkit installation
VLLM_NCCL_SO_PATH (auto-detected) Custom path to the NCCL shared library
LOCAL_RANK 0 Process rank for multi-GPU distributed inference
CUDA_DEVICE_ORDER (system default) Set to PCI_BUS_ID for mixed GPU setups (cuda.py:583-590)

Quick Install

# Install vLLM with CUDA support
pip install vllm

# Verify CUDA availability and compute capability
python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU: {torch.cuda.get_device_name(0)}')
cap = torch.cuda.get_device_capability(0)
print(f'Compute capability: {cap[0]}.{cap[1]}')
print(f'BFloat16 supported: {cap[0] >= 8}')
print(f'FP8 supported: {cap[0]*10 + cap[1] >= 89}')
"

# Verify NVML detection
python -c "import pynvml; pynvml.nvmlInit(); print('NVML OK')"

Code Evidence

BFloat16 compute capability check from vllm/platforms/cuda.py:443-460:

if dtype == torch.bfloat16:
    if not cls.has_device_capability(80):
        raise ValueError(
            "Bfloat16 is only supported on GPUs "
            "with compute capability of at least 8.0. "
        )

FP8 support check from vllm/platforms/cuda.py:422-423:

@classmethod
def supports_fp8(cls) -> bool:
    return cls.has_device_capability(89)

Platform detection via NVML from vllm/platforms/cuda.py:620-632:

nvml_available = False
try:
    pynvml.nvmlInit()
    nvml_available = True
except Exception:
    nvml_available = False
CudaPlatform = NvmlCudaPlatform if nvml_available else NonNvmlCudaPlatform

cuDNN SDP disabled to prevent crashes from vllm/platforms/cuda.py:41:

# pytorch 2.5 uses cudnn sdpa by default, which will cause crash on some models
torch.backends.cuda.enable_cudnn_sdp(False)

NCCL distributed backend from vllm/platforms/cuda.py:103:

dist_backend = "nccl"

Common Errors

Error Message Cause Solution
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0 GPU compute capability < 8.0 (pre-Ampere) Use --dtype=half (FP16) instead of BFloat16
No valid attention backend found GPU compute capability too low for the selected attention backend Switch to a compatible backend; FlashAttention requires SM >= 8.0
CUDA out of memory Insufficient VRAM for the model and KV cache Reduce --gpu-memory-utilization (default 0.9), use a smaller model, or enable tensor parallelism
Failed to import from vllm._C CUDA toolkit version mismatch or incomplete build Reinstall vLLM with matching CUDA version; verify nvcc --version matches expected CUDA 12.9
NVML initialization failure NVML/pynvml not available (e.g., Jetson) Falls back to NonNvmlCudaPlatform automatically; no user action required

Compatibility Notes

  • Blackwell (SM 10.0): FlashInfer MLA is the preferred attention backend. Full support for all dtype modes.
  • Hopper (SM 9.0): Flash Attention is the preferred backend. FP8 quantization supported.
  • Ada Lovelace (SM 8.9): FP8 quantization supported. Flash Attention preferred.
  • Ampere (SM 8.0-8.7): BFloat16 and Flash Attention supported. FP8 not available.
  • Turing (SM 7.5): FP16 only. No BFloat16, no FP8, no FlashAttention.
  • Volta (SM 7.0): FP16 only. Minimum supported architecture.
  • Jetson Devices: NVML is not supported; vLLM automatically falls back to NonNvmlCudaPlatform (cuda.py:620-632).
  • WSL (Windows Subsystem for Linux): pin_memory is disabled, which may reduce data loading performance (interface.py:452-460).
  • Mixed GPU Setups: Set CUDA_DEVICE_ORDER=PCI_BUS_ID to ensure consistent device ordering (cuda.py:583-590).
  • cuDNN SDP: Explicitly disabled at import time to avoid crashes introduced in PyTorch 2.5+ (cuda.py:41).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment