Environment:Sgl project Sglang CUDA GPU Runtime

Knowledge Sources	SGLang NVIDIA CUDA Toolkit
Domains	Infrastructure, GPU_Computing
Last Updated	2026-02-10 00:00 GMT

Overview

Linux environment with NVIDIA CUDA GPU (compute capability >= 7.5 / SM75), CUDA toolkit 12.3+ (12.8+ for Blackwell), and Python 3.10+ for serving LLMs and VLMs with SGLang.

Description

SGLang requires an NVIDIA GPU with compute capability SM75 or higher to function. The runtime uses PyTorch's CUDA backend for tensor operations and custom CUDA kernels (via sgl-kernel and FlashInfer) for high-performance attention, quantization, and MoE layers. Different GPU generations unlock different feature tiers: Ampere (SM80) enables bfloat16, Hopper (SM90) enables Flash Attention 3 and TMA-based kernels, and Blackwell (SM100/SM120) enables Flash Attention 4 and TensorRT-LLM MLA/MHA backends. CUDA 12.3 is the minimum for Hopper features; CUDA 12.8 is required for Blackwell.

Usage

Use this environment for all GPU-accelerated SGLang workflows: offline batch inference, online serving, structured output generation, multimodal VLM inference, model quantization, and the frontend DSL. The CUDA GPU runtime is the primary deployment target and is required by all Implementation pages that perform model inference or generation.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+ recommended)	Windows not officially supported; use WSL2
Hardware	NVIDIA GPU SM75+	Minimum Turing (T4/RTX 2080); Ampere/Hopper/Blackwell recommended
VRAM	8GB minimum	16GB+ recommended; 40-80GB for large models (7B+)
CUDA Toolkit	12.3+ (Hopper), 12.8+ (Blackwell)	Bundled via cuda-python==12.9
Disk	50GB+ SSD	Model weights + KV cache can be large

Dependencies

System Packages

NVIDIA GPU driver (550+ recommended)
`nvidia-smi` (for GPU memory queries)
`cuda-toolkit` >= 12.3 (bundled as `cuda-python==12.9`)

Python Packages

`torch` == 2.9.1
`sgl-kernel` == 0.3.21
`flashinfer_python` == 0.6.2
`flashinfer_cubin` == 0.6.2
`cuda-python` == 12.9
`triton` (for Triton attention/MoE kernels)
`nvidia-cutlass-dsl` >= 4.3.4

Credentials

No credentials required for the CUDA GPU runtime itself. Model downloads may require:

`HF_TOKEN`: HuggingFace API token for gated models (e.g., Llama)

Quick Install

# Install SGLang with all CUDA dependencies
pip install "sglang[all]>=0.4" --find-links https://flashinfer.ai/whl/cu124/torch2.9/flashinfer-python

# Or install from source
pip install -e "python/.[all]"

Code Evidence

GPU compute capability check from `python/sglang/srt/utils/common.py:224-266`:

def _check_cuda_device_version(
    device_capability_majors: List[int], cuda_version: Tuple[int, int]
):
    if not is_cuda():
        return False
    return (
        torch.cuda.get_device_capability()[0] in device_capability_majors
        and tuple(map(int, torch.version.cuda.split(".")[:2])) >= cuda_version
    )

is_ampere_with_cuda_12_3 = lru_cache(maxsize=1)(
    partial(_check_cuda_device_version, device_capability_majors=[8], cuda_version=(12, 3))
)
is_hopper_with_cuda_12_3 = lru_cache(maxsize=1)(
    partial(_check_cuda_device_version, device_capability_majors=[9], cuda_version=(12, 3))
)
is_blackwell_supported = lru_cache(maxsize=1)(
    partial(_check_cuda_device_version, device_capability_majors=[10, 12], cuda_version=(12, 8))
)

Minimum SM75 enforcement from `python/sglang/srt/model_executor/model_runner.py:886-894`:

if self.device == "cuda":
    if torch.cuda.get_device_capability()[0] < 8:
        logger.info("Compute capability below sm80. Use float16 due to lack of bfloat16 support.")
        self.server_args.dtype = "float16"
        self.model_config.dtype = torch.float16
        if torch.cuda.get_device_capability()[1] < 5:
            raise RuntimeError("SGLang only supports sm75 and above.")

CUDA availability detection from `python/sglang/srt/utils/common.py:131-133`:

@lru_cache(maxsize=1)
def is_cuda():
    return torch.cuda.is_available() and torch.version.cuda

Common Errors

Error Message	Cause	Solution
`SGLang only supports sm75 and above.`	GPU compute capability < 7.5	Upgrade to Turing (T4/RTX 2080) or newer GPU
`Compute capability below sm80. Use float16`	GPU is Turing (SM75)	Informational: bfloat16 unavailable, auto-falls back to float16
`nvidia-smi not found`	NVIDIA drivers not installed	Install NVIDIA GPU drivers (550+)
`Unsupported compute capability. Supported: 9.x, 10.x, 11.x`	Flash Attention 4 requires Hopper+	Use `--attention-backend triton` for older GPUs
`CUDA out of memory`	Insufficient VRAM	Reduce `--mem-fraction-static`, enable quantization, or use tensor parallelism

Compatibility Notes

Turing (SM75): Supported with float16 only. No bfloat16, no Flash Attention 3/4.
Ampere (SM80): Full bfloat16 support. FlashInfer attention backend available with CUDA 12.3+.
Hopper (SM90): Flash Attention 3 (fa3), CUTLASS MLA, FlashMLA backends available. Requires CUDA 12.3+.
Blackwell (SM100/SM120): Flash Attention 4 (fa4), TensorRT-LLM MLA/MHA backends. Requires CUDA 12.8+.
Multi-GPU: Tensor parallelism (`--tp`) and data parallelism (`--dp`) supported via NCCL.
FP8 Quantization: Requires SM89+ (Ada Lovelace / Hopper).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment