Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Turboderp org Exllamav2 Flash Attention Backend

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Inference_Optimization, GPU_Computing
Last Updated 2026-02-15 00:00 GMT

Overview

Optional Flash Attention 2.2.1+ dependency for high-performance attention computation, with version 2.5.7+ required for paged attention in the dynamic generator.

Description

Flash Attention is an optional but strongly recommended dependency that provides memory-efficient attention kernels. ExLlamaV2 detects Flash Attention at import time and enables progressively more features based on the installed version:

  • flash_attn >= 2.2.1: Basic flash attention via `flash_attn_func` and `flash_attn_varlen_func`
  • flash_attn >= 2.5.7: Paged attention via `flash_attn_with_kvcache` and the low-level `flash_attn_2_cuda` interface
  • Sliding window support: Detected by inspecting `flash_attn_func` signature for `window_size` parameter
  • Softcap support: Detected by inspecting `flash_attn_func` signature for `softcap` parameter

Without Flash Attention, ExLlamaV2 falls back to xformers, Torch SDPA, or manual matmul attention. Flash Attention requires Ampere or newer GPUs (compute capability >= 8.0).

Usage

Install Flash Attention for all GPU-accelerated inference workflows. It is mandatory for using ExLlamaV2DynamicGenerator in paged mode (the recommended generator). Without it, the dynamic generator falls back to unpaged mode with batch_size=1 and reduced functionality.

System Requirements

Category Requirement Notes
Hardware NVIDIA GPU with compute capability >= 8.0 (Ampere+) RTX 3000 series, A100, A6000, RTX 4000 series, H100, etc.
CUDA CUDA toolkit matching PyTorch build Required for flash-attn compilation
OS Linux or Windows Pre-built wheels primarily for Linux

Dependencies

Python Packages

  • `flash-attn` >= 2.2.1 (basic), >= 2.5.7 (paged attention, recommended)
  • `flash_attn_2_cuda` (bundled with flash-attn >= 2.5.7, provides low-level paged API)

Alternative Backends

If Flash Attention is not installed, ExLlamaV2 uses these fallbacks in priority order:

  • `xformers` (requires `xformers.ops` with `LowerTriangularFromBottomRightMask` support)
  • Torch SDPA (requires PyTorch 2.4.0+ for `causal_lower_right` bias)
  • Manual matmul attention (always available, highest memory usage)

Credentials

No credentials required.

The following environment variables can override attention backend selection:

  • `EXLLAMA_NO_FLASH_ATTN`: Set to any value to disable Flash Attention import
  • `EXLLAMA_NO_XFORMERS`: Set to any value to disable xformers import
  • `EXLLAMA_NO_SDPA`: Set to any value to disable Torch SDPA
  • `EXLLAMA_NO_GRAPHS`: Set to any value to disable CUDA graphs

Quick Install

# Install Flash Attention (requires CUDA and Ampere+ GPU)
pip install flash-attn --no-build-isolation

# For xformers as alternative (works on pre-Ampere GPUs too)
pip install xformers

Code Evidence

Flash Attention version detection from `exllamav2/attn.py:26-59`:

has_flash_attn = False
has_flash_attn_with_paged = False
has_flash_attn_with_window = False
has_flash_attn_with_softcap = False

if 'EXLLAMA_NO_FLASH_ATTN' not in os.environ:
    try:
        import flash_attn
        flash_attn_ver = [int(t) for t in flash_attn.__version__.split(".") if t.isdigit()]

        if [2, 2, 1] <= flash_attn_ver < [2, 5, 7]:
            from flash_attn import flash_attn_func, flash_attn_varlen_func
            has_flash_attn = True

        if [2, 5, 7] <= flash_attn_ver:
            from flash_attn import flash_attn_func, flash_attn_varlen_func, flash_attn_with_kvcache
            import flash_attn_2_cuda as flash_attn_cuda
            has_flash_attn = True
            has_flash_attn_with_paged = True
    except ModuleNotFoundError:
        pass

Ampere GPU requirement check from `exllamav2/attn.py:38-41`:

is_ampere_or_newer_gpu = any(
    torch.cuda.get_device_properties(i).major >= 8
    for i in range(torch.cuda.device_count())
)
if not is_ampere_or_newer_gpu:
    print(" ## Warning: Flash Attention is installed but unsupported GPUs were detected.")

Paged attention assertion from `exllamav2/attn.py:84-91`:

def assert_paged_attn():
    global has_flash_attn_with_paged
    assert has_flash_attn_with_paged, \
        "Paged attention required Flash Attention 2.5.7 or later"

Model compatibility override from `exllamav2/config.py:629-676`:

def arch_compat_overrides(self, quiet: bool = False, warn_only = False):
    if self.attn_logit_softcapping and not has_flash_attn_with_softcap:
        warnings.append(" !! Warning: model requires softcap, not supported in installed version of flash-attn")
    if (self.arch.lm.swa or self.arch.lm.alternating_swa) and not has_flash_attn_with_window:
        warnings.append(" !! Warning: model requires SWA, not supported in installed version of flash-attn")

Common Errors

Error Message Cause Solution
`Paged attention required Flash Attention 2.5.7 or later` flash-attn not installed or version < 2.5.7 `pip install flash-attn>=2.5.7 --no-build-isolation`
`Warning: Flash Attention is installed but unsupported GPUs were detected` GPU compute capability < 8.0 (pre-Ampere) Use xformers instead, or upgrade to Ampere+ GPU
`Warning: model requires softcap, not supported in installed version of flash-attn` Model needs logit softcapping but flash-attn version lacks support Upgrade flash-attn to latest version, or softcap will be disabled
`Warning: model requires SWA, not supported in installed version of flash-attn` Model uses sliding window attention but flash-attn lacks window_size parameter Upgrade flash-attn; SWA will be disabled otherwise

Compatibility Notes

  • Pre-Ampere GPUs (SM < 8.0): Flash Attention cannot be used. Install xformers as the attention backend instead.
  • ROCm (AMD): Flash Attention has limited ROCm support. The SDPA backend is noted as "unreliable on ROCm" in the source. xformers may be the best option for AMD GPUs.
  • Sliding Window Attention: Not supported in tensor-parallel mode regardless of Flash Attention version.
  • Without Flash Attention: The dynamic generator operates in unpaged mode with max batch_size=1, no CFG support, and no prefix caching. This significantly limits throughput.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment