Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Diffusers Attention Backends

From Leeroopedia
Knowledge Sources
Domains Optimization, Infrastructure
Last Updated 2026-02-13 21:00 GMT

Overview

Optional attention backend environment for Diffusers: Flash Attention >= 2.6.3, xFormers >= 0.0.29, SageAttention >= 2.1.1, and other accelerated attention implementations.

Description

Diffusers supports multiple attention backend implementations selected via the `DIFFUSERS_ATTN_BACKEND` environment variable (default: `"native"`). The native backend uses PyTorch's `F.scaled_dot_product_attention` (requires PyTorch >= 2.0). Optional accelerated backends provide better performance for specific hardware. Flash Attention v2 and v3 target NVIDIA GPUs, SageAttention provides INT8/FP8 quantized attention, and xFormers offers memory-efficient attention. The backend dispatch is centralized in `attention_dispatch.py`, which checks availability and version requirements at import time.

Usage

Use when you need faster inference or lower memory attention computation. Flash Attention provides the best performance on Ampere+ GPUs. xFormers is an alternative for older GPU architectures. SageAttention offers quantized attention for extreme memory savings.

System Requirements

Category Requirement Notes
Hardware NVIDIA GPU (Ampere+ for Flash Attention) SM80+ for Flash Attn; SM90 for FP8 SageAttention
PyTorch >= 2.0 for native SDPA; >= 2.5.0 for flex_attention flex_attention is the newest backend

Dependencies

Attention Backend Packages

Backend Package Min Version Notes
Flash Attention v2 `flash_attn` >= 2.6.3 NVIDIA Ampere+ GPUs (A100, H100, RTX 3090+)
Flash Attention v3 `flash_attn_3` (latest) Hopper GPUs (H100)
AITER `aiter` >= 0.1.5 AMD GPU flash attention
SageAttention `sageattention` >= 2.1.1 INT8/FP8 quantized attention
Flex Attention (PyTorch built-in) torch >= 2.5.0 PyTorch native flex attention
xFormers `xformers` >= 0.0.29 Memory-efficient attention
XLA Attention `torch_xla` >= 2.2 TPU/XLA flash attention
NPU Attention `torch_npu` (latest) Huawei NPU fusion attention

Credentials

No additional credentials required.

Quick Install

# Flash Attention (recommended for NVIDIA GPUs)
pip install flash-attn --no-build-isolation

# xFormers (alternative memory-efficient attention)
pip install xformers

# SageAttention (quantized attention)
pip install sageattention

# Set attention backend via environment variable
export DIFFUSERS_ATTN_BACKEND=flash_attn  # or: native, xformers, sage_attn, flex_attn

Code Evidence

Version requirements and availability checks from `attention_dispatch.py:58-72`:

_REQUIRED_FLASH_VERSION = "2.6.3"
_REQUIRED_AITER_VERSION = "0.1.5"
_REQUIRED_SAGE_VERSION = "2.1.1"
_REQUIRED_FLEX_VERSION = "2.5.0"
_REQUIRED_XLA_VERSION = "2.2"
_REQUIRED_XFORMERS_VERSION = "0.0.29"

_CAN_USE_FLASH_ATTN = is_flash_attn_available() and is_flash_attn_version(">=", _REQUIRED_FLASH_VERSION)
_CAN_USE_FLASH_ATTN_3 = is_flash_attn_3_available()
_CAN_USE_AITER_ATTN = is_aiter_available() and is_aiter_version(">=", _REQUIRED_AITER_VERSION)
_CAN_USE_SAGE_ATTN = is_sageattention_available() and is_sageattention_version(">=", _REQUIRED_SAGE_VERSION)
_CAN_USE_FLEX_ATTN = is_torch_version(">=", _REQUIRED_FLEX_VERSION)
_CAN_USE_NPU_ATTN = is_torch_npu_available()
_CAN_USE_XLA_ATTN = is_torch_xla_available() and is_torch_xla_version(">=", _REQUIRED_XLA_VERSION)
_CAN_USE_XFORMERS_ATTN = is_xformers_available() and is_xformers_version(">=", _REQUIRED_XFORMERS_VERSION)

Backend selection via environment variable from `constants.py:44`:

DIFFUSERS_ATTN_BACKEND = os.getenv("DIFFUSERS_ATTN_BACKEND", "native")

Attention constraint checking from `attention_dispatch.py:411-440`:

# Constraint functions applied to all attention backends
def _check_device():  # Verify tensors on correct device
def _check_qkv_dtype_bf16_or_fp16():  # Flash/Sage require bf16 or fp16
def _check_shape():  # Query must be 4D

Common Errors

Error Message Cause Solution
`ImportError: flash_attn not found` Flash Attention not installed `pip install flash-attn --no-build-isolation`
`Flash attention requires bf16 or fp16 dtype` Input tensors in float32 Cast model to half precision: `pipe.to(torch.float16)`
`torch_npu is not available` NPU attention requested without torch_npu Install torch_npu for Huawei NPU hardware
`torch_xla is not available` XLA attention requested without torch_xla Install torch_xla for TPU/XLA hardware

Compatibility Notes

  • Native SDPA: Default backend. Works on all devices. Requires PyTorch >= 2.0.
  • Flash Attention: Best performance on Ampere+ (A100, H100, RTX 3090/4090). FP16/BF16 only.
  • xFormers: Good alternative for older GPUs (Volta, Turing). Device and dtype checks applied.
  • SageAttention: FP8 mode requires SM90 (Hopper). INT8 mode works on Ampere+.
  • Flex Attention: PyTorch native, supports compile. Requires PyTorch >= 2.5.0.
  • DIFFUSERS_ATTN_CHECKS: Set to `"1"` to enable runtime attention constraint validation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment