Environment:Huggingface Diffusers Attention Backends
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Optional attention backend environment for Diffusers: Flash Attention >= 2.6.3, xFormers >= 0.0.29, SageAttention >= 2.1.1, and other accelerated attention implementations.
Description
Diffusers supports multiple attention backend implementations selected via the `DIFFUSERS_ATTN_BACKEND` environment variable (default: `"native"`). The native backend uses PyTorch's `F.scaled_dot_product_attention` (requires PyTorch >= 2.0). Optional accelerated backends provide better performance for specific hardware. Flash Attention v2 and v3 target NVIDIA GPUs, SageAttention provides INT8/FP8 quantized attention, and xFormers offers memory-efficient attention. The backend dispatch is centralized in `attention_dispatch.py`, which checks availability and version requirements at import time.
Usage
Use when you need faster inference or lower memory attention computation. Flash Attention provides the best performance on Ampere+ GPUs. xFormers is an alternative for older GPU architectures. SageAttention offers quantized attention for extreme memory savings.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA GPU (Ampere+ for Flash Attention) | SM80+ for Flash Attn; SM90 for FP8 SageAttention |
| PyTorch | >= 2.0 for native SDPA; >= 2.5.0 for flex_attention | flex_attention is the newest backend |
Dependencies
Attention Backend Packages
| Backend | Package | Min Version | Notes |
|---|---|---|---|
| Flash Attention v2 | `flash_attn` | >= 2.6.3 | NVIDIA Ampere+ GPUs (A100, H100, RTX 3090+) |
| Flash Attention v3 | `flash_attn_3` | (latest) | Hopper GPUs (H100) |
| AITER | `aiter` | >= 0.1.5 | AMD GPU flash attention |
| SageAttention | `sageattention` | >= 2.1.1 | INT8/FP8 quantized attention |
| Flex Attention | (PyTorch built-in) | torch >= 2.5.0 | PyTorch native flex attention |
| xFormers | `xformers` | >= 0.0.29 | Memory-efficient attention |
| XLA Attention | `torch_xla` | >= 2.2 | TPU/XLA flash attention |
| NPU Attention | `torch_npu` | (latest) | Huawei NPU fusion attention |
Credentials
No additional credentials required.
Quick Install
# Flash Attention (recommended for NVIDIA GPUs)
pip install flash-attn --no-build-isolation
# xFormers (alternative memory-efficient attention)
pip install xformers
# SageAttention (quantized attention)
pip install sageattention
# Set attention backend via environment variable
export DIFFUSERS_ATTN_BACKEND=flash_attn # or: native, xformers, sage_attn, flex_attn
Code Evidence
Version requirements and availability checks from `attention_dispatch.py:58-72`:
_REQUIRED_FLASH_VERSION = "2.6.3"
_REQUIRED_AITER_VERSION = "0.1.5"
_REQUIRED_SAGE_VERSION = "2.1.1"
_REQUIRED_FLEX_VERSION = "2.5.0"
_REQUIRED_XLA_VERSION = "2.2"
_REQUIRED_XFORMERS_VERSION = "0.0.29"
_CAN_USE_FLASH_ATTN = is_flash_attn_available() and is_flash_attn_version(">=", _REQUIRED_FLASH_VERSION)
_CAN_USE_FLASH_ATTN_3 = is_flash_attn_3_available()
_CAN_USE_AITER_ATTN = is_aiter_available() and is_aiter_version(">=", _REQUIRED_AITER_VERSION)
_CAN_USE_SAGE_ATTN = is_sageattention_available() and is_sageattention_version(">=", _REQUIRED_SAGE_VERSION)
_CAN_USE_FLEX_ATTN = is_torch_version(">=", _REQUIRED_FLEX_VERSION)
_CAN_USE_NPU_ATTN = is_torch_npu_available()
_CAN_USE_XLA_ATTN = is_torch_xla_available() and is_torch_xla_version(">=", _REQUIRED_XLA_VERSION)
_CAN_USE_XFORMERS_ATTN = is_xformers_available() and is_xformers_version(">=", _REQUIRED_XFORMERS_VERSION)
Backend selection via environment variable from `constants.py:44`:
DIFFUSERS_ATTN_BACKEND = os.getenv("DIFFUSERS_ATTN_BACKEND", "native")
Attention constraint checking from `attention_dispatch.py:411-440`:
# Constraint functions applied to all attention backends
def _check_device(): # Verify tensors on correct device
def _check_qkv_dtype_bf16_or_fp16(): # Flash/Sage require bf16 or fp16
def _check_shape(): # Query must be 4D
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: flash_attn not found` | Flash Attention not installed | `pip install flash-attn --no-build-isolation` |
| `Flash attention requires bf16 or fp16 dtype` | Input tensors in float32 | Cast model to half precision: `pipe.to(torch.float16)` |
| `torch_npu is not available` | NPU attention requested without torch_npu | Install torch_npu for Huawei NPU hardware |
| `torch_xla is not available` | XLA attention requested without torch_xla | Install torch_xla for TPU/XLA hardware |
Compatibility Notes
- Native SDPA: Default backend. Works on all devices. Requires PyTorch >= 2.0.
- Flash Attention: Best performance on Ampere+ (A100, H100, RTX 3090/4090). FP16/BF16 only.
- xFormers: Good alternative for older GPUs (Volta, Turing). Device and dtype checks applied.
- SageAttention: FP8 mode requires SM90 (Hopper). INT8 mode works on Ampere+.
- Flex Attention: PyTorch native, supports compile. Requires PyTorch >= 2.5.0.
- DIFFUSERS_ATTN_CHECKS: Set to `"1"` to enable runtime attention constraint validation.