Heuristic:Sgl project Sglang Attention Backend Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Attention |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Decision framework for selecting the optimal attention backend (`--attention-backend`) based on GPU architecture, model type (MHA vs MLA), and workload characteristics.
Description
SGLang supports 16+ attention backends, each optimized for different GPU generations and model architectures. The automatic selection logic in `server_args.py` picks defaults based on compute capability and model type, but manual override can yield significant performance gains for specific workloads. Key distinctions: MHA (Multi-Head Attention) models like Llama/Qwen use different optimal backends than MLA (Multi-head Latent Attention) models like DeepSeek-V3. GPU generation determines which backends are available: Hopper unlocks fa3/FlashMLA/CUTLASS MLA, Blackwell unlocks fa4/TensorRT-LLM backends.
Usage
Use this heuristic when deploying a model on a specific GPU type to select the fastest attention backend. Also use when debugging performance issues — switching backends can yield 10-50% throughput improvements for specific model/GPU combinations.
The Insight (Rule of Thumb)
For MHA models (Llama, Qwen, Mistral):
- Blackwell (SM100/120): Use `trtllm_mha` (default) or `fa4`
- Hopper (SM90): Use `fa3` (default, requires CUDA 12.3+)
- Ampere/Ada (SM80/89): Use `flashinfer` if available, else `triton`
- Turing (SM75): Use `triton` (only option)
For MLA models (DeepSeek-V3/R1):
- Blackwell (SM100/120): Use `trtllm_mla` (default)
- Hopper (SM90): Use `fa3` (default)
- Ampere/Ada: Use `triton`
For AMD ROCm: Use `aiter` or `wave`
For Intel XPU: Use `intel_xpu`
For Ascend NPU: Use `ascend`
For CPU: Use `intel_amx`
Page size considerations:
- For maximum prefix cache reuse: `--page-size 1` (token-level matching)
- FlashMLA requires page_size=64, CUTLASS MLA requires 128, TensorRT-LLM MLA requires 32 or 64
- FlashInfer MLA supports page_size=1
Reasoning
The available backends from `python/sglang/srt/server_args.py:122-144`:
ATTENTION_BACKEND_CHOICES = [
# Common
"triton", "torch_native", "flex_attention", "nsa",
# NVIDIA specific
"cutlass_mla", "fa3", "fa4", "flashinfer", "flashmla",
"trtllm_mla", "trtllm_mha", "dual_chunk_flash_attn",
# AMD specific
"aiter", "wave",
# Other platforms
"intel_amx", "ascend", "intel_xpu",
]
The automatic selection is hardware-driven because each backend uses different GPU instructions:
- fa3 uses Hopper's TMA (Tensor Memory Accelerator) instructions for SM90
- fa4 uses Blackwell's SM100/SM120 CUTLASS DSL kernels
- flashinfer uses FlashInfer's highly optimized CUDA kernels for SM80+
- triton is the universal fallback using Triton JIT compilation
- trtllm_mla/mha use TensorRT-LLM's pre-compiled kernels for maximum Blackwell performance
For NSA (Native Sparse Attention) on DeepSeek-V3.2, a sparse vs dense threshold is used:
# From nsa_backend.py
if total_kv_tokens < total_q_tokens * 512:
use_flashmla_sparse # More efficient for sparse patterns
else:
use_flashmla_kv # Better for dense KV patterns