Heuristic:Vllm project Vllm Attention Backend Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, GPU_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
vLLM automatically selects the optimal attention backend based on GPU architecture (SM capability) and whether the model uses Multi-Head Latent Attention (MLA). The selection is a ranked priority list, where the first available backend is chosen. Blackwell GPUs (SM 10.0) prefer FlashInfer, while older architectures (Hopper/Ampere) prefer Flash Attention.
Description
The CUDA platform in vLLM determines the attention backend through a priority-ordered list that varies across two dimensions: GPU generation (Blackwell SM 10.0 vs. earlier) and attention type (MLA vs. standard). Each combination returns a distinct ranked list of backend candidates. The runtime attempts each backend in order and uses the first one whose dependencies are satisfied. Additionally, MLA backends impose specific KV cache block size constraints, and cuDNN SDPA is globally disabled due to known crash issues in PyTorch 2.5+.
Usage
Apply this heuristic when:
- Deploying vLLM on different GPU architectures and wanting to understand which attention backend will be auto-selected.
- Debugging performance regressions after moving between GPU generations (e.g., from Hopper to Blackwell).
- Deciding whether to override the automatic selection with
--attention-backend. - Troubleshooting KV cache block size errors with MLA-based models (e.g., DeepSeek-V2/V3).
The Insight (Rule of Thumb)
- Blackwell GPUs (SM 10.0) + MLA models: FlashInfer MLA is preferred first, followed by CUTLASS MLA, Flash Attention MLA, FlashMLA, then Triton MLA.
- Older GPUs (SM < 10.0) + MLA models: Flash Attention MLA is preferred first, followed by FlashMLA, FlashInfer MLA, then Triton MLA.
- Blackwell GPUs (SM 10.0) + standard attention: FlashInfer is preferred first, followed by Flash Attention, Triton Attention, then Flex Attention.
- Older GPUs (SM < 10.0) + standard attention: Flash Attention is preferred first, followed by FlashInfer, Triton Attention, then Flex Attention.
- MLA block size constraints: FlashMLA requires
block_sizedivisible by 64; CUTLASS MLA requires divisible by 128; FlashInfer MLA requires 32 or divisible by 64. The engine auto-forces compliant block sizes. - cuDNN SDPA is globally disabled: PyTorch 2.5 defaults to cuDNN SDPA, which causes crashes on some models. vLLM calls
torch.backends.cuda.enable_cudnn_sdp(False)at platform init. - FlashAttention for ViT: Requires SM >= 8.0 (Ampere or newer).
- Manual override: Pass
--attention-backend BACKEND_NAMEto force a specific backend if the automatic selection is suboptimal for your workload.
Reasoning
The priority ordering reflects empirical performance findings on each GPU generation. On Blackwell (SM 10.0), FlashInfer provides the best kernel implementations for both MLA and standard attention due to optimizations targeting the new architecture. On older GPUs (Hopper SM 9.0, Ampere SM 8.x), Flash Attention has the most mature and tuned kernels, so it takes priority. The MLA variants exist because Multi-Head Latent Attention (used by DeepSeek-V2/V3) requires specialized kernels that handle the latent compression differently from standard multi-head attention.
The block size constraints for MLA backends are dictated by hardware alignment requirements in each kernel implementation. FlashMLA tiles its computation in 64-token blocks, CUTLASS MLA uses 128-token tiles for optimal tensor core utilization, and FlashInfer MLA supports either 32-token blocks or multiples of 64. Mismatched block sizes would cause incorrect memory access patterns or kernel failures.
The global disabling of cuDNN SDPA is a defensive measure. PyTorch 2.5 changed the default to enable cuDNN's scaled dot-product attention, but this triggers crashes on certain model architectures. vLLM preemptively disables it during CUDA platform initialization.
Code Evidence
Backend priority for MLA on Blackwell from vllm/platforms/cuda.py:51-57:
if use_mla:
if device_capability.major == 10:
return [
AttentionBackendEnum.FLASHINFER_MLA,
AttentionBackendEnum.CUTLASS_MLA,
AttentionBackendEnum.FLASH_ATTN_MLA,
AttentionBackendEnum.FLASHMLA,
AttentionBackendEnum.TRITON_MLA,
]
Backend priority for MLA on non-Blackwell from vllm/platforms/cuda.py:61-67:
else:
return [
AttentionBackendEnum.FLASH_ATTN_MLA,
AttentionBackendEnum.FLASHMLA,
AttentionBackendEnum.FLASHINFER_MLA,
AttentionBackendEnum.TRITON_MLA,
]
Backend priority for non-MLA on Blackwell from vllm/platforms/cuda.py:69-75:
else:
if device_capability.major == 10:
return [
AttentionBackendEnum.FLASHINFER,
AttentionBackendEnum.FLASH_ATTN,
AttentionBackendEnum.TRITON_ATTN,
AttentionBackendEnum.FLEX_ATTENTION,
]
Backend priority for non-MLA on non-Blackwell from vllm/platforms/cuda.py:77-82:
else:
return [
AttentionBackendEnum.FLASH_ATTN,
AttentionBackendEnum.FLASHINFER,
AttentionBackendEnum.TRITON_ATTN,
AttentionBackendEnum.FLEX_ATTENTION,
]
MLA block size forcing from vllm/platforms/cuda.py:224-250 (summarized):
# FlashMLA requires block_size divisible by 64
# CUTLASS_MLA requires block_size divisible by 128
# FlashInferMLA requires block_size 32 or divisible by 64
FlashAttention for ViT requires SM >= 8.0 from vllm/platforms/cuda.py:399:
if (cc := cls.get_device_capability()) and cc.major >= 8:
try:
backend_class = AttentionBackendEnum.FLASH_ATTN.get_class()
Global cuDNN SDPA disable from vllm/platforms/cuda.py:40-41:
# pytorch 2.5 uses cudnn sdpa by default, which will cause crash on some models
torch.backends.cuda.enable_cudnn_sdp(False)