Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Vllm project Vllm CUDA Hopper

From Leeroopedia


Knowledge Sources
Domains GPU_Computing, NVIDIA_Hopper
Last Updated 2026-02-08 00:00 GMT

Overview

NVIDIA Hopper (SM 9.0) GPU architecture environment for vLLM, targeting H100 and H200 accelerators with Hopper-specific features including the Tensor Memory Accelerator (TMA), native FP8 compute, and fourth-generation Tensor Cores.

Description

This environment defines the hardware and software requirements for running vLLM on NVIDIA Hopper-class GPUs (H100, H200). Hopper introduces several architectural features that vLLM exploits for maximum inference throughput: the Tensor Memory Accelerator (TMA) for asynchronous bulk data movement between global and shared memory, native FP8 (E4M3/E5M2) arithmetic on Tensor Cores, and the Transformer Engine for mixed-precision attention and MLP computation. vLLM's FlashAttention backend achieves peak performance on Hopper through warp-specialized kernels that overlap TMA loads with MMA (Matrix Multiply-Accumulate) instructions. FP8 quantized inference on Hopper provides up to 2x throughput improvement over FP16/BF16 with minimal accuracy loss. The H200 variant adds 141 GB HBM3e memory, enabling larger KV caches and longer context lengths without tensor parallelism.

Usage

To target Hopper GPUs, install vLLM with CUDA 12.x support. vLLM automatically detects Hopper hardware via compute capability checks (SM >= 9.0) and enables Hopper-optimized code paths. For FP8 inference, use the --quantization fp8 flag or load a pre-quantized FP8 model. The VLLM_USE_DEEP_GEMM=1 environment variable enables DeepGemm kernels optimized for Hopper's Tensor Cores. Batch invariance benchmarks validate that output quality remains consistent regardless of batch size on Hopper hardware.

Requirements

Requirement Value
GPU NVIDIA H100, H200, or compatible Hopper-class GPU
Compute Capability SM 9.0+
CUDA Toolkit 12.x (12.4+ recommended)
GPU Memory 80 GB HBM3 (H100) or 141 GB HBM3e (H200)
NVLink NVLink 4.0 (900 GB/s bidirectional) for multi-GPU
Driver NVIDIA driver >= 535
Hopper Features TMA, FP8 native, 4th-gen Tensor Cores, Thread Block Clusters
Python Package torch >= 2.9.1 with CUDA 12 support

Semantic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment