Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Turboderp org Exllamav2 Attention Backend Selection

From Leeroopedia
Knowledge Sources
Domains Inference_Optimization, GPU_Computing
Last Updated 2026-02-15 00:00 GMT

Overview

Decision framework for attention backend selection: Flash Attention (Ampere+, best performance) > xformers (pre-Ampere fallback) > Torch SDPA (PyTorch 2.4+, unreliable on ROCm) > manual matmul (universal fallback, highest memory).

Description

ExLlamaV2 supports four attention backends with automatic fallback. The selection is determined at import time based on installed packages, GPU capabilities, and environment variables. Without Flash Attention, the dynamic generator operates in degraded unpaged mode. Understanding this hierarchy helps diagnose performance issues and choose the right backend for your hardware.

Usage

Apply this heuristic when choosing which attention packages to install, debugging slow inference, or troubleshooting attention-related errors. The backend can be overridden via environment variables for testing.

The Insight (Rule of Thumb)

  • Action: For Ampere+ GPUs (RTX 3000/4000, A100, H100), install `flash-attn >= 2.5.7`.
  • Value: Provides paged attention (required for dynamic generator), sliding window, and softcap support.
  • Trade-off: Flash Attention compilation can be slow. Pre-built wheels are available.
  • Action: For pre-Ampere GPUs (RTX 2000, V100, T4), install `xformers` instead.
  • Value: Best performance available on SM < 8.0 hardware with memory-efficient attention.
  • Trade-off: No paged attention support. Dynamic generator falls back to unpaged mode (batch_size=1 max).
  • Action: Without flash-attn or xformers, attention size is limited by `max_attention_size` (default 2048^2).
  • Value: The quadratic formula `cs = (sqrt(past_len^2 + 4*max_a) - past_len) / 2` limits chunk size to prevent OOM.
  • Trade-off: Long sequences are processed in smaller chunks, reducing throughput.
  • Action: Set `EXLLAMA_NO_SDPA=1` if using ROCm (AMD GPUs).
  • Value: SDPA is noted as "unreliable on ROCm" in the source code.
  • Trade-off: Falls through to manual matmul attention.
  • Action: Sliding window attention is not supported in tensor-parallel mode.
  • Value: Assertion failure if attempted.
  • Trade-off: Use single-GPU or multi-GPU auto-split without SWA models, or disable TP.

Reasoning

Flash Attention uses a tiling algorithm that keeps memory usage constant regardless of sequence length, making it essential for long-context inference. The paged variant (2.5.7+) enables virtual memory-style cache management that is the foundation of ExLlamaV2's concurrent batching system.

xformers provides a similar memory-efficient attention implementation that works on older GPU architectures. It lacks the paged API needed for the dynamic generator's full feature set.

Without either library, attention computation falls back to explicit Q*K^T matmul with O(n^2) memory, requiring chunked processing for long sequences.

From `exllamav2/model.py:879-895`:

if (has_flash_attn and not self.config.no_flash_attn) or \
   (has_xformers and not self.config.no_xformers):
    pass  # Can't measure increase in VRAM with longer k_len
else:
    attn_size = (past_len + remaining_q_len) * remaining_q_len
    if attn_size > max_a:
        cs = (math.sqrt(past_len ** 2 + 4 * max_a) - past_len) / 2
        chunk_size = min(chunk_size, math.floor(cs))

Backend priority from `exllamav2/attn.py`:

# Priority order:
# 1. Flash Attention (if has_flash_attn and not config.no_flash_attn)
# 2. xformers (if has_xformers and not config.no_xformers)
# 3. Torch SDPA (if has_lower_right_sdpa and not config.no_sdpa)
# 4. Manual matmul attention (always available)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment