Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Sgl project Sglang Attention Backend Selection

From Leeroopedia



Knowledge Sources
Domains Optimization, Attention
Last Updated 2026-02-10 00:00 GMT

Overview

Decision framework for selecting the optimal attention backend (`--attention-backend`) based on GPU architecture, model type (MHA vs MLA), and workload characteristics.

Description

SGLang supports 16+ attention backends, each optimized for different GPU generations and model architectures. The automatic selection logic in `server_args.py` picks defaults based on compute capability and model type, but manual override can yield significant performance gains for specific workloads. Key distinctions: MHA (Multi-Head Attention) models like Llama/Qwen use different optimal backends than MLA (Multi-head Latent Attention) models like DeepSeek-V3. GPU generation determines which backends are available: Hopper unlocks fa3/FlashMLA/CUTLASS MLA, Blackwell unlocks fa4/TensorRT-LLM backends.

Usage

Use this heuristic when deploying a model on a specific GPU type to select the fastest attention backend. Also use when debugging performance issues — switching backends can yield 10-50% throughput improvements for specific model/GPU combinations.

The Insight (Rule of Thumb)

For MHA models (Llama, Qwen, Mistral):

  • Blackwell (SM100/120): Use `trtllm_mha` (default) or `fa4`
  • Hopper (SM90): Use `fa3` (default, requires CUDA 12.3+)
  • Ampere/Ada (SM80/89): Use `flashinfer` if available, else `triton`
  • Turing (SM75): Use `triton` (only option)

For MLA models (DeepSeek-V3/R1):

  • Blackwell (SM100/120): Use `trtllm_mla` (default)
  • Hopper (SM90): Use `fa3` (default)
  • Ampere/Ada: Use `triton`

For AMD ROCm: Use `aiter` or `wave`

For Intel XPU: Use `intel_xpu`

For Ascend NPU: Use `ascend`

For CPU: Use `intel_amx`

Page size considerations:

  • For maximum prefix cache reuse: `--page-size 1` (token-level matching)
  • FlashMLA requires page_size=64, CUTLASS MLA requires 128, TensorRT-LLM MLA requires 32 or 64
  • FlashInfer MLA supports page_size=1

Reasoning

The available backends from `python/sglang/srt/server_args.py:122-144`:

ATTENTION_BACKEND_CHOICES = [
    # Common
    "triton", "torch_native", "flex_attention", "nsa",
    # NVIDIA specific
    "cutlass_mla", "fa3", "fa4", "flashinfer", "flashmla",
    "trtllm_mla", "trtllm_mha", "dual_chunk_flash_attn",
    # AMD specific
    "aiter", "wave",
    # Other platforms
    "intel_amx", "ascend", "intel_xpu",
]

The automatic selection is hardware-driven because each backend uses different GPU instructions:

  • fa3 uses Hopper's TMA (Tensor Memory Accelerator) instructions for SM90
  • fa4 uses Blackwell's SM100/SM120 CUTLASS DSL kernels
  • flashinfer uses FlashInfer's highly optimized CUDA kernels for SM80+
  • triton is the universal fallback using Triton JIT compilation
  • trtllm_mla/mha use TensorRT-LLM's pre-compiled kernels for maximum Blackwell performance

For NSA (Native Sparse Attention) on DeepSeek-V3.2, a sparse vs dense threshold is used:

# From nsa_backend.py
if total_kv_tokens < total_q_tokens * 512:
    use_flashmla_sparse  # More efficient for sparse patterns
else:
    use_flashmla_kv      # Better for dense KV patterns

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment