Environment:Vllm project Vllm CUDA Hopper
| Knowledge Sources | |
|---|---|
| Domains | GPU_Computing, NVIDIA_Hopper |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
NVIDIA Hopper (SM 9.0) GPU architecture environment for vLLM, targeting H100 and H200 accelerators with Hopper-specific features including the Tensor Memory Accelerator (TMA), native FP8 compute, and fourth-generation Tensor Cores.
Description
This environment defines the hardware and software requirements for running vLLM on NVIDIA Hopper-class GPUs (H100, H200). Hopper introduces several architectural features that vLLM exploits for maximum inference throughput: the Tensor Memory Accelerator (TMA) for asynchronous bulk data movement between global and shared memory, native FP8 (E4M3/E5M2) arithmetic on Tensor Cores, and the Transformer Engine for mixed-precision attention and MLP computation. vLLM's FlashAttention backend achieves peak performance on Hopper through warp-specialized kernels that overlap TMA loads with MMA (Matrix Multiply-Accumulate) instructions. FP8 quantized inference on Hopper provides up to 2x throughput improvement over FP16/BF16 with minimal accuracy loss. The H200 variant adds 141 GB HBM3e memory, enabling larger KV caches and longer context lengths without tensor parallelism.
Usage
To target Hopper GPUs, install vLLM with CUDA 12.x support. vLLM automatically detects Hopper hardware via compute capability checks (SM >= 9.0) and enables Hopper-optimized code paths. For FP8 inference, use the --quantization fp8 flag or load a pre-quantized FP8 model. The VLLM_USE_DEEP_GEMM=1 environment variable enables DeepGemm kernels optimized for Hopper's Tensor Cores. Batch invariance benchmarks validate that output quality remains consistent regardless of batch size on Hopper hardware.
Requirements
| Requirement | Value |
|---|---|
| GPU | NVIDIA H100, H200, or compatible Hopper-class GPU |
| Compute Capability | SM 9.0+ |
| CUDA Toolkit | 12.x (12.4+ recommended) |
| GPU Memory | 80 GB HBM3 (H100) or 141 GB HBM3e (H200) |
| NVLink | NVLink 4.0 (900 GB/s bidirectional) for multi-GPU |
| Driver | NVIDIA driver >= 535 |
| Hopper Features | TMA, FP8 native, 4th-gen Tensor Cores, Thread Block Clusters |
| Python Package | torch >= 2.9.1 with CUDA 12 support |