Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Sgl project Sglang CUDA GPU Runtime

From Leeroopedia


Knowledge Sources
Domains Infrastructure, GPU_Computing
Last Updated 2026-02-10 00:00 GMT

Overview

Linux environment with NVIDIA CUDA GPU (compute capability >= 7.5 / SM75), CUDA toolkit 12.3+ (12.8+ for Blackwell), and Python 3.10+ for serving LLMs and VLMs with SGLang.

Description

SGLang requires an NVIDIA GPU with compute capability SM75 or higher to function. The runtime uses PyTorch's CUDA backend for tensor operations and custom CUDA kernels (via sgl-kernel and FlashInfer) for high-performance attention, quantization, and MoE layers. Different GPU generations unlock different feature tiers: Ampere (SM80) enables bfloat16, Hopper (SM90) enables Flash Attention 3 and TMA-based kernels, and Blackwell (SM100/SM120) enables Flash Attention 4 and TensorRT-LLM MLA/MHA backends. CUDA 12.3 is the minimum for Hopper features; CUDA 12.8 is required for Blackwell.

Usage

Use this environment for all GPU-accelerated SGLang workflows: offline batch inference, online serving, structured output generation, multimodal VLM inference, model quantization, and the frontend DSL. The CUDA GPU runtime is the primary deployment target and is required by all Implementation pages that perform model inference or generation.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+ recommended) Windows not officially supported; use WSL2
Hardware NVIDIA GPU SM75+ Minimum Turing (T4/RTX 2080); Ampere/Hopper/Blackwell recommended
VRAM 8GB minimum 16GB+ recommended; 40-80GB for large models (7B+)
CUDA Toolkit 12.3+ (Hopper), 12.8+ (Blackwell) Bundled via cuda-python==12.9
Disk 50GB+ SSD Model weights + KV cache can be large

Dependencies

System Packages

  • NVIDIA GPU driver (550+ recommended)
  • `nvidia-smi` (for GPU memory queries)
  • `cuda-toolkit` >= 12.3 (bundled as `cuda-python==12.9`)

Python Packages

  • `torch` == 2.9.1
  • `sgl-kernel` == 0.3.21
  • `flashinfer_python` == 0.6.2
  • `flashinfer_cubin` == 0.6.2
  • `cuda-python` == 12.9
  • `triton` (for Triton attention/MoE kernels)
  • `nvidia-cutlass-dsl` >= 4.3.4

Credentials

No credentials required for the CUDA GPU runtime itself. Model downloads may require:

  • `HF_TOKEN`: HuggingFace API token for gated models (e.g., Llama)

Quick Install

# Install SGLang with all CUDA dependencies
pip install "sglang[all]>=0.4" --find-links https://flashinfer.ai/whl/cu124/torch2.9/flashinfer-python

# Or install from source
pip install -e "python/.[all]"

Code Evidence

GPU compute capability check from `python/sglang/srt/utils/common.py:224-266`:

def _check_cuda_device_version(
    device_capability_majors: List[int], cuda_version: Tuple[int, int]
):
    if not is_cuda():
        return False
    return (
        torch.cuda.get_device_capability()[0] in device_capability_majors
        and tuple(map(int, torch.version.cuda.split(".")[:2])) >= cuda_version
    )

is_ampere_with_cuda_12_3 = lru_cache(maxsize=1)(
    partial(_check_cuda_device_version, device_capability_majors=[8], cuda_version=(12, 3))
)
is_hopper_with_cuda_12_3 = lru_cache(maxsize=1)(
    partial(_check_cuda_device_version, device_capability_majors=[9], cuda_version=(12, 3))
)
is_blackwell_supported = lru_cache(maxsize=1)(
    partial(_check_cuda_device_version, device_capability_majors=[10, 12], cuda_version=(12, 8))
)

Minimum SM75 enforcement from `python/sglang/srt/model_executor/model_runner.py:886-894`:

if self.device == "cuda":
    if torch.cuda.get_device_capability()[0] < 8:
        logger.info("Compute capability below sm80. Use float16 due to lack of bfloat16 support.")
        self.server_args.dtype = "float16"
        self.model_config.dtype = torch.float16
        if torch.cuda.get_device_capability()[1] < 5:
            raise RuntimeError("SGLang only supports sm75 and above.")

CUDA availability detection from `python/sglang/srt/utils/common.py:131-133`:

@lru_cache(maxsize=1)
def is_cuda():
    return torch.cuda.is_available() and torch.version.cuda

Common Errors

Error Message Cause Solution
`SGLang only supports sm75 and above.` GPU compute capability < 7.5 Upgrade to Turing (T4/RTX 2080) or newer GPU
`Compute capability below sm80. Use float16` GPU is Turing (SM75) Informational: bfloat16 unavailable, auto-falls back to float16
`nvidia-smi not found` NVIDIA drivers not installed Install NVIDIA GPU drivers (550+)
`Unsupported compute capability. Supported: 9.x, 10.x, 11.x` Flash Attention 4 requires Hopper+ Use `--attention-backend triton` for older GPUs
`CUDA out of memory` Insufficient VRAM Reduce `--mem-fraction-static`, enable quantization, or use tensor parallelism

Compatibility Notes

  • Turing (SM75): Supported with float16 only. No bfloat16, no Flash Attention 3/4.
  • Ampere (SM80): Full bfloat16 support. FlashInfer attention backend available with CUDA 12.3+.
  • Hopper (SM90): Flash Attention 3 (fa3), CUTLASS MLA, FlashMLA backends available. Requires CUDA 12.3+.
  • Blackwell (SM100/SM120): Flash Attention 4 (fa4), TensorRT-LLM MLA/MHA backends. Requires CUDA 12.8+.
  • Multi-GPU: Tensor parallelism (`--tp`) and data parallelism (`--dp`) supported via NCCL.
  • FP8 Quantization: Requires SM89+ (Ada Lovelace / Hopper).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment