Environment:Vllm project Vllm CUDA Hopper

Knowledge Sources	vllm NVIDIA Hopper Architecture
Domains	GPU_Computing, NVIDIA_Hopper
Last Updated	2026-02-08 00:00 GMT

Overview

NVIDIA Hopper (SM 9.0) GPU architecture environment for vLLM, targeting H100 and H200 accelerators with Hopper-specific features including the Tensor Memory Accelerator (TMA), native FP8 compute, and fourth-generation Tensor Cores.

Description

This environment defines the hardware and software requirements for running vLLM on NVIDIA Hopper-class GPUs (H100, H200). Hopper introduces several architectural features that vLLM exploits for maximum inference throughput: the Tensor Memory Accelerator (TMA) for asynchronous bulk data movement between global and shared memory, native FP8 (E4M3/E5M2) arithmetic on Tensor Cores, and the Transformer Engine for mixed-precision attention and MLP computation. vLLM's FlashAttention backend achieves peak performance on Hopper through warp-specialized kernels that overlap TMA loads with MMA (Matrix Multiply-Accumulate) instructions. FP8 quantized inference on Hopper provides up to 2x throughput improvement over FP16/BF16 with minimal accuracy loss. The H200 variant adds 141 GB HBM3e memory, enabling larger KV caches and longer context lengths without tensor parallelism.

Usage

To target Hopper GPUs, install vLLM with CUDA 12.x support. vLLM automatically detects Hopper hardware via compute capability checks (SM >= 9.0) and enables Hopper-optimized code paths. For FP8 inference, use the --quantization fp8 flag or load a pre-quantized FP8 model. The VLLM_USE_DEEP_GEMM=1 environment variable enables DeepGemm kernels optimized for Hopper's Tensor Cores. Batch invariance benchmarks validate that output quality remains consistent regardless of batch size on Hopper hardware.

Requirements

Requirement	Value
GPU	NVIDIA H100, H200, or compatible Hopper-class GPU
Compute Capability	SM 9.0+
CUDA Toolkit	12.x (12.4+ recommended)
GPU Memory	80 GB HBM3 (H100) or 141 GB HBM3e (H200)
NVLink	NVLink 4.0 (900 GB/s bidirectional) for multi-GPU
Driver	NVIDIA driver >= 535
Hopper Features	TMA, FP8 native, 4th-gen Tensor Cores, Thread Block Clusters
Python Package	torch >= 2.9.1 with CUDA 12 support

Semantic Links

Implementation:Vllm_project_Vllm_Benchmark_Batch_Invariance

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment