Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Transformers Flash Attention 2 Env

From Leeroopedia
Knowledge Sources
Domains Optimization, Infrastructure, GPU
Last Updated 2026-02-13 20:00 GMT

Overview

Flash Attention 2 environment for memory-efficient and faster attention computation on CUDA/ROCm GPUs.

Description

Flash Attention 2 is an optimized attention implementation that reduces memory usage from O(N^2) to O(N) and provides significant speedups for long sequences. The Transformers library supports Flash Attention 2 as an optional backend via attn_implementation="flash_attention_2". Version requirements differ by platform: CUDA requires flash_attn >= 2.1.0, ROCm requires >= 2.0.4, and MLU requires >= 2.3.3. Flash Attention 3 is also supported as a separate package (flash_attn_3).

Usage

Use this environment when you need faster attention computation or longer sequence lengths during training or inference. Particularly beneficial for models with long context windows (4K+ tokens).

System Requirements

Category Requirement Notes
OS Linux Compilation requires Linux
Hardware NVIDIA GPU (Ampere+) or AMD GPU (ROCm) Compute capability >= 8.0 recommended
VRAM Same as base model Flash Attention reduces memory, not increases it
CUDA 11.8+ For NVIDIA GPUs

Dependencies

System Packages

  • NVIDIA CUDA Toolkit 11.8+ (for compilation)
  • C++ compiler (gcc/g++)

Python Packages

  • torch >= 2.4.0
  • flash-attn >= 2.1.0 (for CUDA)
  • flash-attn >= 2.0.4 (for ROCm)
  • flash-attn >= 2.3.3 (for MLU)
  • flash-attn-3 (optional, for Flash Attention 3)

Credentials

No credentials required.

Quick Install

# Install flash-attn (may require compilation)
pip install flash-attn --no-build-isolation

# For Flash Attention 3 (optional)
pip install flash-attn-3

Code Evidence

Platform-aware version checking from src/transformers/utils/import_utils.py:853-871:

@lru_cache
def is_flash_attn_2_available() -> bool:
    is_available, flash_attn_version = _is_package_available("flash_attn", return_version=True)
    if not is_available or not (is_torch_cuda_available() or is_torch_mlu_available()):
        return False

    import torch
    try:
        if torch.version.cuda:
            return version.parse(flash_attn_version) >= version.parse("2.1.0")
        elif torch.version.hip:
            return version.parse(flash_attn_version) >= version.parse("2.0.4")
        elif is_torch_mlu_available():
            return version.parse(flash_attn_version) >= version.parse("2.3.3")
        else:
            return False
    except packaging.version.InvalidVersion:
        return False

Flash Attention 3 check from src/transformers/utils/import_utils.py:875-876:

@lru_cache
def is_flash_attn_3_available() -> bool:
    return is_torch_cuda_available() and _is_package_available("flash_attn_3")

Deterministic mode support from src/transformers/trainer_utils.py:171:

os.environ["FLASH_ATTENTION_DETERMINISTIC"] = "1"

Common Errors

Error Message Cause Solution
ImportError: flash_attn flash-attn not installed pip install flash-attn --no-build-isolation
flash_attn requires CUDA No CUDA GPU available Flash Attention requires an NVIDIA or AMD GPU
Compilation error during install Missing build tools Install gcc/g++ and CUDA toolkit headers
FlashAttention only supports fp16/bf16 Float32 input tensors Cast model to fp16 or bf16 before using Flash Attention

Compatibility Notes

  • NVIDIA CUDA: Requires flash_attn >= 2.1.0. Best performance on Ampere (A100) and Hopper (H100) GPUs.
  • AMD ROCm: Requires flash_attn >= 2.0.4 from the ROCm-specific fork.
  • Cambricon MLU: Requires flash_attn >= 2.3.3.
  • Intel XPU: Not supported by Flash Attention; use native SDPA instead.
  • Apple MPS: Not supported; use native SDPA instead.
  • CPU: Not supported; Flash Attention is GPU-only.
  • Data Types: Only supports fp16 and bf16 inputs (not float32).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment