Environment:Sgl project Sglang CUDA GPU Runtime
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Linux environment with NVIDIA CUDA GPU (compute capability >= 7.5 / SM75), CUDA toolkit 12.3+ (12.8+ for Blackwell), and Python 3.10+ for serving LLMs and VLMs with SGLang.
Description
SGLang requires an NVIDIA GPU with compute capability SM75 or higher to function. The runtime uses PyTorch's CUDA backend for tensor operations and custom CUDA kernels (via sgl-kernel and FlashInfer) for high-performance attention, quantization, and MoE layers. Different GPU generations unlock different feature tiers: Ampere (SM80) enables bfloat16, Hopper (SM90) enables Flash Attention 3 and TMA-based kernels, and Blackwell (SM100/SM120) enables Flash Attention 4 and TensorRT-LLM MLA/MHA backends. CUDA 12.3 is the minimum for Hopper features; CUDA 12.8 is required for Blackwell.
Usage
Use this environment for all GPU-accelerated SGLang workflows: offline batch inference, online serving, structured output generation, multimodal VLM inference, model quantization, and the frontend DSL. The CUDA GPU runtime is the primary deployment target and is required by all Implementation pages that perform model inference or generation.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) | Windows not officially supported; use WSL2 |
| Hardware | NVIDIA GPU SM75+ | Minimum Turing (T4/RTX 2080); Ampere/Hopper/Blackwell recommended |
| VRAM | 8GB minimum | 16GB+ recommended; 40-80GB for large models (7B+) |
| CUDA Toolkit | 12.3+ (Hopper), 12.8+ (Blackwell) | Bundled via cuda-python==12.9 |
| Disk | 50GB+ SSD | Model weights + KV cache can be large |
Dependencies
System Packages
- NVIDIA GPU driver (550+ recommended)
- `nvidia-smi` (for GPU memory queries)
- `cuda-toolkit` >= 12.3 (bundled as `cuda-python==12.9`)
Python Packages
- `torch` == 2.9.1
- `sgl-kernel` == 0.3.21
- `flashinfer_python` == 0.6.2
- `flashinfer_cubin` == 0.6.2
- `cuda-python` == 12.9
- `triton` (for Triton attention/MoE kernels)
- `nvidia-cutlass-dsl` >= 4.3.4
Credentials
No credentials required for the CUDA GPU runtime itself. Model downloads may require:
- `HF_TOKEN`: HuggingFace API token for gated models (e.g., Llama)
Quick Install
# Install SGLang with all CUDA dependencies
pip install "sglang[all]>=0.4" --find-links https://flashinfer.ai/whl/cu124/torch2.9/flashinfer-python
# Or install from source
pip install -e "python/.[all]"
Code Evidence
GPU compute capability check from `python/sglang/srt/utils/common.py:224-266`:
def _check_cuda_device_version(
device_capability_majors: List[int], cuda_version: Tuple[int, int]
):
if not is_cuda():
return False
return (
torch.cuda.get_device_capability()[0] in device_capability_majors
and tuple(map(int, torch.version.cuda.split(".")[:2])) >= cuda_version
)
is_ampere_with_cuda_12_3 = lru_cache(maxsize=1)(
partial(_check_cuda_device_version, device_capability_majors=[8], cuda_version=(12, 3))
)
is_hopper_with_cuda_12_3 = lru_cache(maxsize=1)(
partial(_check_cuda_device_version, device_capability_majors=[9], cuda_version=(12, 3))
)
is_blackwell_supported = lru_cache(maxsize=1)(
partial(_check_cuda_device_version, device_capability_majors=[10, 12], cuda_version=(12, 8))
)
Minimum SM75 enforcement from `python/sglang/srt/model_executor/model_runner.py:886-894`:
if self.device == "cuda":
if torch.cuda.get_device_capability()[0] < 8:
logger.info("Compute capability below sm80. Use float16 due to lack of bfloat16 support.")
self.server_args.dtype = "float16"
self.model_config.dtype = torch.float16
if torch.cuda.get_device_capability()[1] < 5:
raise RuntimeError("SGLang only supports sm75 and above.")
CUDA availability detection from `python/sglang/srt/utils/common.py:131-133`:
@lru_cache(maxsize=1)
def is_cuda():
return torch.cuda.is_available() and torch.version.cuda
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `SGLang only supports sm75 and above.` | GPU compute capability < 7.5 | Upgrade to Turing (T4/RTX 2080) or newer GPU |
| `Compute capability below sm80. Use float16` | GPU is Turing (SM75) | Informational: bfloat16 unavailable, auto-falls back to float16 |
| `nvidia-smi not found` | NVIDIA drivers not installed | Install NVIDIA GPU drivers (550+) |
| `Unsupported compute capability. Supported: 9.x, 10.x, 11.x` | Flash Attention 4 requires Hopper+ | Use `--attention-backend triton` for older GPUs |
| `CUDA out of memory` | Insufficient VRAM | Reduce `--mem-fraction-static`, enable quantization, or use tensor parallelism |
Compatibility Notes
- Turing (SM75): Supported with float16 only. No bfloat16, no Flash Attention 3/4.
- Ampere (SM80): Full bfloat16 support. FlashInfer attention backend available with CUDA 12.3+.
- Hopper (SM90): Flash Attention 3 (fa3), CUTLASS MLA, FlashMLA backends available. Requires CUDA 12.3+.
- Blackwell (SM100/SM120): Flash Attention 4 (fa4), TensorRT-LLM MLA/MHA backends. Requires CUDA 12.8+.
- Multi-GPU: Tensor parallelism (`--tp`) and data parallelism (`--dp`) supported via NCCL.
- FP8 Quantization: Requires SM89+ (Ada Lovelace / Hopper).
Related Pages
- Implementation:Sgl_project_Sglang_ServerArgs_Init
- Implementation:Sgl_project_Sglang_Engine_Init
- Implementation:Sgl_project_Sglang_Engine_Generate
- Implementation:Sgl_project_Sglang_Launch_Server
- Implementation:Sgl_project_Sglang_Init_Distributed_Environment
- Implementation:Sgl_project_Sglang_Get_Model_Loader
- Implementation:Sgl_project_Sglang_Engine_Generate_Multimodal
- Implementation:Sgl_project_Sglang_Multimodal_Data_Loading