Environment:Huggingface Transformers Flash Attention 2 Env
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure, GPU |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Flash Attention 2 environment for memory-efficient and faster attention computation on CUDA/ROCm GPUs.
Description
Flash Attention 2 is an optimized attention implementation that reduces memory usage from O(N^2) to O(N) and provides significant speedups for long sequences. The Transformers library supports Flash Attention 2 as an optional backend via attn_implementation="flash_attention_2". Version requirements differ by platform: CUDA requires flash_attn >= 2.1.0, ROCm requires >= 2.0.4, and MLU requires >= 2.3.3. Flash Attention 3 is also supported as a separate package (flash_attn_3).
Usage
Use this environment when you need faster attention computation or longer sequence lengths during training or inference. Particularly beneficial for models with long context windows (4K+ tokens).
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Compilation requires Linux |
| Hardware | NVIDIA GPU (Ampere+) or AMD GPU (ROCm) | Compute capability >= 8.0 recommended |
| VRAM | Same as base model | Flash Attention reduces memory, not increases it |
| CUDA | 11.8+ | For NVIDIA GPUs |
Dependencies
System Packages
- NVIDIA CUDA Toolkit 11.8+ (for compilation)
- C++ compiler (gcc/g++)
Python Packages
torch>= 2.4.0flash-attn>= 2.1.0 (for CUDA)flash-attn>= 2.0.4 (for ROCm)flash-attn>= 2.3.3 (for MLU)flash-attn-3(optional, for Flash Attention 3)
Credentials
No credentials required.
Quick Install
# Install flash-attn (may require compilation)
pip install flash-attn --no-build-isolation
# For Flash Attention 3 (optional)
pip install flash-attn-3
Code Evidence
Platform-aware version checking from src/transformers/utils/import_utils.py:853-871:
@lru_cache
def is_flash_attn_2_available() -> bool:
is_available, flash_attn_version = _is_package_available("flash_attn", return_version=True)
if not is_available or not (is_torch_cuda_available() or is_torch_mlu_available()):
return False
import torch
try:
if torch.version.cuda:
return version.parse(flash_attn_version) >= version.parse("2.1.0")
elif torch.version.hip:
return version.parse(flash_attn_version) >= version.parse("2.0.4")
elif is_torch_mlu_available():
return version.parse(flash_attn_version) >= version.parse("2.3.3")
else:
return False
except packaging.version.InvalidVersion:
return False
Flash Attention 3 check from src/transformers/utils/import_utils.py:875-876:
@lru_cache
def is_flash_attn_3_available() -> bool:
return is_torch_cuda_available() and _is_package_available("flash_attn_3")
Deterministic mode support from src/transformers/trainer_utils.py:171:
os.environ["FLASH_ATTENTION_DETERMINISTIC"] = "1"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
ImportError: flash_attn |
flash-attn not installed | pip install flash-attn --no-build-isolation
|
flash_attn requires CUDA |
No CUDA GPU available | Flash Attention requires an NVIDIA or AMD GPU |
Compilation error during install |
Missing build tools | Install gcc/g++ and CUDA toolkit headers |
FlashAttention only supports fp16/bf16 |
Float32 input tensors | Cast model to fp16 or bf16 before using Flash Attention |
Compatibility Notes
- NVIDIA CUDA: Requires flash_attn >= 2.1.0. Best performance on Ampere (A100) and Hopper (H100) GPUs.
- AMD ROCm: Requires flash_attn >= 2.0.4 from the ROCm-specific fork.
- Cambricon MLU: Requires flash_attn >= 2.3.3.
- Intel XPU: Not supported by Flash Attention; use native SDPA instead.
- Apple MPS: Not supported; use native SDPA instead.
- CPU: Not supported; Flash Attention is GPU-only.
- Data Types: Only supports fp16 and bf16 inputs (not float32).