Environment:Huggingface Diffusers Attention Backends

Knowledge Sources	Huggingface Diffusers Flash Attention
Domains	Optimization, Infrastructure
Last Updated	2026-02-13 21:00 GMT

Overview

Optional attention backend environment for Diffusers: Flash Attention >= 2.6.3, xFormers >= 0.0.29, SageAttention >= 2.1.1, and other accelerated attention implementations.

Description

Diffusers supports multiple attention backend implementations selected via the `DIFFUSERS_ATTN_BACKEND` environment variable (default: `"native"`). The native backend uses PyTorch's `F.scaled_dot_product_attention` (requires PyTorch >= 2.0). Optional accelerated backends provide better performance for specific hardware. Flash Attention v2 and v3 target NVIDIA GPUs, SageAttention provides INT8/FP8 quantized attention, and xFormers offers memory-efficient attention. The backend dispatch is centralized in `attention_dispatch.py`, which checks availability and version requirements at import time.

Usage

Use when you need faster inference or lower memory attention computation. Flash Attention provides the best performance on Ampere+ GPUs. xFormers is an alternative for older GPU architectures. SageAttention offers quantized attention for extreme memory savings.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA GPU (Ampere+ for Flash Attention)	SM80+ for Flash Attn; SM90 for FP8 SageAttention
PyTorch	>= 2.0 for native SDPA; >= 2.5.0 for flex_attention	flex_attention is the newest backend

Dependencies

Attention Backend Packages

Backend	Package	Min Version	Notes
Flash Attention v2	`flash_attn`	>= 2.6.3	NVIDIA Ampere+ GPUs (A100, H100, RTX 3090+)
Flash Attention v3	`flash_attn_3`	(latest)	Hopper GPUs (H100)
AITER	`aiter`	>= 0.1.5	AMD GPU flash attention
SageAttention	`sageattention`	>= 2.1.1	INT8/FP8 quantized attention
Flex Attention	(PyTorch built-in)	torch >= 2.5.0	PyTorch native flex attention
xFormers	`xformers`	>= 0.0.29	Memory-efficient attention
XLA Attention	`torch_xla`	>= 2.2	TPU/XLA flash attention
NPU Attention	`torch_npu`	(latest)	Huawei NPU fusion attention

Credentials

No additional credentials required.

Quick Install

# Flash Attention (recommended for NVIDIA GPUs)
pip install flash-attn --no-build-isolation

# xFormers (alternative memory-efficient attention)
pip install xformers

# SageAttention (quantized attention)
pip install sageattention

# Set attention backend via environment variable
export DIFFUSERS_ATTN_BACKEND=flash_attn  # or: native, xformers, sage_attn, flex_attn

Code Evidence

Version requirements and availability checks from `attention_dispatch.py:58-72`:

_REQUIRED_FLASH_VERSION = "2.6.3"
_REQUIRED_AITER_VERSION = "0.1.5"
_REQUIRED_SAGE_VERSION = "2.1.1"
_REQUIRED_FLEX_VERSION = "2.5.0"
_REQUIRED_XLA_VERSION = "2.2"
_REQUIRED_XFORMERS_VERSION = "0.0.29"

_CAN_USE_FLASH_ATTN = is_flash_attn_available() and is_flash_attn_version(">=", _REQUIRED_FLASH_VERSION)
_CAN_USE_FLASH_ATTN_3 = is_flash_attn_3_available()
_CAN_USE_AITER_ATTN = is_aiter_available() and is_aiter_version(">=", _REQUIRED_AITER_VERSION)
_CAN_USE_SAGE_ATTN = is_sageattention_available() and is_sageattention_version(">=", _REQUIRED_SAGE_VERSION)
_CAN_USE_FLEX_ATTN = is_torch_version(">=", _REQUIRED_FLEX_VERSION)
_CAN_USE_NPU_ATTN = is_torch_npu_available()
_CAN_USE_XLA_ATTN = is_torch_xla_available() and is_torch_xla_version(">=", _REQUIRED_XLA_VERSION)
_CAN_USE_XFORMERS_ATTN = is_xformers_available() and is_xformers_version(">=", _REQUIRED_XFORMERS_VERSION)

Backend selection via environment variable from `constants.py:44`:

DIFFUSERS_ATTN_BACKEND = os.getenv("DIFFUSERS_ATTN_BACKEND", "native")

Attention constraint checking from `attention_dispatch.py:411-440`:

# Constraint functions applied to all attention backends
def _check_device():  # Verify tensors on correct device
def _check_qkv_dtype_bf16_or_fp16():  # Flash/Sage require bf16 or fp16
def _check_shape():  # Query must be 4D

Common Errors

Error Message	Cause	Solution
`ImportError: flash_attn not found`	Flash Attention not installed	`pip install flash-attn --no-build-isolation`
`Flash attention requires bf16 or fp16 dtype`	Input tensors in float32	Cast model to half precision: `pipe.to(torch.float16)`
`torch_npu is not available`	NPU attention requested without torch_npu	Install torch_npu for Huawei NPU hardware
`torch_xla is not available`	XLA attention requested without torch_xla	Install torch_xla for TPU/XLA hardware

Compatibility Notes

Native SDPA: Default backend. Works on all devices. Requires PyTorch >= 2.0.
Flash Attention: Best performance on Ampere+ (A100, H100, RTX 3090/4090). FP16/BF16 only.
xFormers: Good alternative for older GPUs (Volta, Turing). Device and dtype checks applied.
SageAttention: FP8 mode requires SM90 (Hopper). INT8 mode works on Ampere+.
Flex Attention: PyTorch native, supports compile. Requires PyTorch >= 2.5.0.
DIFFUSERS_ATTN_CHECKS: Set to `"1"` to enable runtime attention constraint validation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment