Environment:Turboderp org Exllamav2 Flash Attention Backend
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Inference_Optimization, GPU_Computing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Optional Flash Attention 2.2.1+ dependency for high-performance attention computation, with version 2.5.7+ required for paged attention in the dynamic generator.
Description
Flash Attention is an optional but strongly recommended dependency that provides memory-efficient attention kernels. ExLlamaV2 detects Flash Attention at import time and enables progressively more features based on the installed version:
- flash_attn >= 2.2.1: Basic flash attention via `flash_attn_func` and `flash_attn_varlen_func`
- flash_attn >= 2.5.7: Paged attention via `flash_attn_with_kvcache` and the low-level `flash_attn_2_cuda` interface
- Sliding window support: Detected by inspecting `flash_attn_func` signature for `window_size` parameter
- Softcap support: Detected by inspecting `flash_attn_func` signature for `softcap` parameter
Without Flash Attention, ExLlamaV2 falls back to xformers, Torch SDPA, or manual matmul attention. Flash Attention requires Ampere or newer GPUs (compute capability >= 8.0).
Usage
Install Flash Attention for all GPU-accelerated inference workflows. It is mandatory for using ExLlamaV2DynamicGenerator in paged mode (the recommended generator). Without it, the dynamic generator falls back to unpaged mode with batch_size=1 and reduced functionality.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA GPU with compute capability >= 8.0 (Ampere+) | RTX 3000 series, A100, A6000, RTX 4000 series, H100, etc. |
| CUDA | CUDA toolkit matching PyTorch build | Required for flash-attn compilation |
| OS | Linux or Windows | Pre-built wheels primarily for Linux |
Dependencies
Python Packages
- `flash-attn` >= 2.2.1 (basic), >= 2.5.7 (paged attention, recommended)
- `flash_attn_2_cuda` (bundled with flash-attn >= 2.5.7, provides low-level paged API)
Alternative Backends
If Flash Attention is not installed, ExLlamaV2 uses these fallbacks in priority order:
- `xformers` (requires `xformers.ops` with `LowerTriangularFromBottomRightMask` support)
- Torch SDPA (requires PyTorch 2.4.0+ for `causal_lower_right` bias)
- Manual matmul attention (always available, highest memory usage)
Credentials
No credentials required.
The following environment variables can override attention backend selection:
- `EXLLAMA_NO_FLASH_ATTN`: Set to any value to disable Flash Attention import
- `EXLLAMA_NO_XFORMERS`: Set to any value to disable xformers import
- `EXLLAMA_NO_SDPA`: Set to any value to disable Torch SDPA
- `EXLLAMA_NO_GRAPHS`: Set to any value to disable CUDA graphs
Quick Install
# Install Flash Attention (requires CUDA and Ampere+ GPU)
pip install flash-attn --no-build-isolation
# For xformers as alternative (works on pre-Ampere GPUs too)
pip install xformers
Code Evidence
Flash Attention version detection from `exllamav2/attn.py:26-59`:
has_flash_attn = False
has_flash_attn_with_paged = False
has_flash_attn_with_window = False
has_flash_attn_with_softcap = False
if 'EXLLAMA_NO_FLASH_ATTN' not in os.environ:
try:
import flash_attn
flash_attn_ver = [int(t) for t in flash_attn.__version__.split(".") if t.isdigit()]
if [2, 2, 1] <= flash_attn_ver < [2, 5, 7]:
from flash_attn import flash_attn_func, flash_attn_varlen_func
has_flash_attn = True
if [2, 5, 7] <= flash_attn_ver:
from flash_attn import flash_attn_func, flash_attn_varlen_func, flash_attn_with_kvcache
import flash_attn_2_cuda as flash_attn_cuda
has_flash_attn = True
has_flash_attn_with_paged = True
except ModuleNotFoundError:
pass
Ampere GPU requirement check from `exllamav2/attn.py:38-41`:
is_ampere_or_newer_gpu = any(
torch.cuda.get_device_properties(i).major >= 8
for i in range(torch.cuda.device_count())
)
if not is_ampere_or_newer_gpu:
print(" ## Warning: Flash Attention is installed but unsupported GPUs were detected.")
Paged attention assertion from `exllamav2/attn.py:84-91`:
def assert_paged_attn():
global has_flash_attn_with_paged
assert has_flash_attn_with_paged, \
"Paged attention required Flash Attention 2.5.7 or later"
Model compatibility override from `exllamav2/config.py:629-676`:
def arch_compat_overrides(self, quiet: bool = False, warn_only = False):
if self.attn_logit_softcapping and not has_flash_attn_with_softcap:
warnings.append(" !! Warning: model requires softcap, not supported in installed version of flash-attn")
if (self.arch.lm.swa or self.arch.lm.alternating_swa) and not has_flash_attn_with_window:
warnings.append(" !! Warning: model requires SWA, not supported in installed version of flash-attn")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Paged attention required Flash Attention 2.5.7 or later` | flash-attn not installed or version < 2.5.7 | `pip install flash-attn>=2.5.7 --no-build-isolation` |
| `Warning: Flash Attention is installed but unsupported GPUs were detected` | GPU compute capability < 8.0 (pre-Ampere) | Use xformers instead, or upgrade to Ampere+ GPU |
| `Warning: model requires softcap, not supported in installed version of flash-attn` | Model needs logit softcapping but flash-attn version lacks support | Upgrade flash-attn to latest version, or softcap will be disabled |
| `Warning: model requires SWA, not supported in installed version of flash-attn` | Model uses sliding window attention but flash-attn lacks window_size parameter | Upgrade flash-attn; SWA will be disabled otherwise |
Compatibility Notes
- Pre-Ampere GPUs (SM < 8.0): Flash Attention cannot be used. Install xformers as the attention backend instead.
- ROCm (AMD): Flash Attention has limited ROCm support. The SDPA backend is noted as "unreliable on ROCm" in the source. xformers may be the best option for AMD GPUs.
- Sliding Window Attention: Not supported in tensor-parallel mode regardless of Flash Attention version.
- Without Flash Attention: The dynamic generator operates in unpaged mode with max batch_size=1, no CFG support, and no prefix caching. This significantly limits throughput.