Environment:Turboderp org Exllamav2 Flash Attention Backend

Knowledge Sources	ExLlamaV2 Flash Attention
Domains	Infrastructure, Inference_Optimization, GPU_Computing
Last Updated	2026-02-15 00:00 GMT

Overview

Optional Flash Attention 2.2.1+ dependency for high-performance attention computation, with version 2.5.7+ required for paged attention in the dynamic generator.

Description

Flash Attention is an optional but strongly recommended dependency that provides memory-efficient attention kernels. ExLlamaV2 detects Flash Attention at import time and enables progressively more features based on the installed version:

flash_attn >= 2.2.1: Basic flash attention via `flash_attn_func` and `flash_attn_varlen_func`
flash_attn >= 2.5.7: Paged attention via `flash_attn_with_kvcache` and the low-level `flash_attn_2_cuda` interface
Sliding window support: Detected by inspecting `flash_attn_func` signature for `window_size` parameter
Softcap support: Detected by inspecting `flash_attn_func` signature for `softcap` parameter

Without Flash Attention, ExLlamaV2 falls back to xformers, Torch SDPA, or manual matmul attention. Flash Attention requires Ampere or newer GPUs (compute capability >= 8.0).

Usage

Install Flash Attention for all GPU-accelerated inference workflows. It is mandatory for using ExLlamaV2DynamicGenerator in paged mode (the recommended generator). Without it, the dynamic generator falls back to unpaged mode with batch_size=1 and reduced functionality.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA GPU with compute capability >= 8.0 (Ampere+)	RTX 3000 series, A100, A6000, RTX 4000 series, H100, etc.
CUDA	CUDA toolkit matching PyTorch build	Required for flash-attn compilation
OS	Linux or Windows	Pre-built wheels primarily for Linux

Dependencies

Python Packages

`flash-attn` >= 2.2.1 (basic), >= 2.5.7 (paged attention, recommended)
`flash_attn_2_cuda` (bundled with flash-attn >= 2.5.7, provides low-level paged API)

Alternative Backends

If Flash Attention is not installed, ExLlamaV2 uses these fallbacks in priority order:

`xformers` (requires `xformers.ops` with `LowerTriangularFromBottomRightMask` support)
Torch SDPA (requires PyTorch 2.4.0+ for `causal_lower_right` bias)
Manual matmul attention (always available, highest memory usage)

Credentials

No credentials required.

The following environment variables can override attention backend selection:

`EXLLAMA_NO_FLASH_ATTN`: Set to any value to disable Flash Attention import
`EXLLAMA_NO_XFORMERS`: Set to any value to disable xformers import
`EXLLAMA_NO_SDPA`: Set to any value to disable Torch SDPA
`EXLLAMA_NO_GRAPHS`: Set to any value to disable CUDA graphs

Quick Install

# Install Flash Attention (requires CUDA and Ampere+ GPU)
pip install flash-attn --no-build-isolation

# For xformers as alternative (works on pre-Ampere GPUs too)
pip install xformers

Code Evidence

Flash Attention version detection from `exllamav2/attn.py:26-59`:

has_flash_attn = False
has_flash_attn_with_paged = False
has_flash_attn_with_window = False
has_flash_attn_with_softcap = False

if 'EXLLAMA_NO_FLASH_ATTN' not in os.environ:
    try:
        import flash_attn
        flash_attn_ver = [int(t) for t in flash_attn.__version__.split(".") if t.isdigit()]

        if [2, 2, 1] <= flash_attn_ver < [2, 5, 7]:
            from flash_attn import flash_attn_func, flash_attn_varlen_func
            has_flash_attn = True

        if [2, 5, 7] <= flash_attn_ver:
            from flash_attn import flash_attn_func, flash_attn_varlen_func, flash_attn_with_kvcache
            import flash_attn_2_cuda as flash_attn_cuda
            has_flash_attn = True
            has_flash_attn_with_paged = True
    except ModuleNotFoundError:
        pass

Ampere GPU requirement check from `exllamav2/attn.py:38-41`:

is_ampere_or_newer_gpu = any(
    torch.cuda.get_device_properties(i).major >= 8
    for i in range(torch.cuda.device_count())
)
if not is_ampere_or_newer_gpu:
    print(" ## Warning: Flash Attention is installed but unsupported GPUs were detected.")

Paged attention assertion from `exllamav2/attn.py:84-91`:

def assert_paged_attn():
    global has_flash_attn_with_paged
    assert has_flash_attn_with_paged, \
        "Paged attention required Flash Attention 2.5.7 or later"

Model compatibility override from `exllamav2/config.py:629-676`:

def arch_compat_overrides(self, quiet: bool = False, warn_only = False):
    if self.attn_logit_softcapping and not has_flash_attn_with_softcap:
        warnings.append(" !! Warning: model requires softcap, not supported in installed version of flash-attn")
    if (self.arch.lm.swa or self.arch.lm.alternating_swa) and not has_flash_attn_with_window:
        warnings.append(" !! Warning: model requires SWA, not supported in installed version of flash-attn")

Common Errors

Error Message	Cause	Solution
`Paged attention required Flash Attention 2.5.7 or later`	flash-attn not installed or version < 2.5.7	`pip install flash-attn>=2.5.7 --no-build-isolation`
`Warning: Flash Attention is installed but unsupported GPUs were detected`	GPU compute capability < 8.0 (pre-Ampere)	Use xformers instead, or upgrade to Ampere+ GPU
`Warning: model requires softcap, not supported in installed version of flash-attn`	Model needs logit softcapping but flash-attn version lacks support	Upgrade flash-attn to latest version, or softcap will be disabled
`Warning: model requires SWA, not supported in installed version of flash-attn`	Model uses sliding window attention but flash-attn lacks window_size parameter	Upgrade flash-attn; SWA will be disabled otherwise

Compatibility Notes

Pre-Ampere GPUs (SM < 8.0): Flash Attention cannot be used. Install xformers as the attention backend instead.
ROCm (AMD): Flash Attention has limited ROCm support. The SDPA backend is noted as "unreliable on ROCm" in the source. xformers may be the best option for AMD GPUs.
Sliding Window Attention: Not supported in tensor-parallel mode regardless of Flash Attention version.
Without Flash Attention: The dynamic generator operates in unpaged mode with max batch_size=1, no CFG support, and no prefix caching. This significantly limits throughput.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment