Environment:OpenGVLab InternVL Flash Attention 2

Knowledge Sources	OpenGVLab/InternVL Dao-AILab/flash-attention
Domains	Infrastructure, Optimization, Deep_Learning
Last Updated	2026-02-07 14:00 GMT

Overview

Flash Attention 2 acceleration library requiring NVIDIA GPU with compute capability >= 8.0 (Ampere/Hopper) for efficient attention computation in InternVL training and inference.

Description

Flash Attention 2 provides memory-efficient and IO-aware exact attention computation. InternVL uses it in three locations: the InternViT vision encoder (via `flash_attn_varlen_qkvpacked_func`), the InternLM2 language model (via `flash_attn_func` and `flash_attn_varlen_func`), and the LLaMA/Phi-3/Qwen2 language models (via `flash_attn_varlen_kvpacked_func`). The library is optional but strongly recommended: the code auto-detects its availability and falls back to eager attention when not installed.

Usage

Use this environment when training or fine-tuning InternVL models with GPU acceleration. Flash Attention 2 is automatically enabled when available and provides significant speedup and memory savings. It is required for packed sequence training (varlen attention) and recommended for all training workflows.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA GPU with compute capability >= 8.0	A100, H100, RTX 3090/4090 (Ampere/Hopper/Ada)
CUDA	CUDA 11.6+	Required for Flash Attention compilation
OS	Linux	Windows not supported for flash-attn

Dependencies

Python Packages

`flash-attn` >= 2.1.0 (version 2.8.2 used in GPT-OSS variant)
`torch` >= 2.0 (prerequisite)

Submodules Used

`flash_attn.flash_attn_interface.flash_attn_func`
`flash_attn.flash_attn_interface.flash_attn_varlen_func`
`flash_attn.flash_attn_interface.flash_attn_varlen_qkvpacked_func`
`flash_attn.flash_attn_interface.flash_attn_varlen_kvpacked_func`
`flash_attn.bert_padding.pad_input`
`flash_attn.bert_padding.unpad_input`

Credentials

No credentials required.

Quick Install

# Install Flash Attention 2 (requires CUDA toolkit and compatible GPU)
pip install flash-attn --no-build-isolation

Code Evidence

Flash Attention availability check from `modeling_intern_vit.py:23-30`:

try:
    from flash_attn.bert_padding import pad_input, unpad_input
    from flash_attn.flash_attn_interface import \
        flash_attn_varlen_qkvpacked_func
    has_flash_attn = True
except:
    print('FlashAttention2 is not installed.')
    has_flash_attn = False

Auto-selection logic from `modeling_internvl_chat.py:62-64`:

use_flash_attn = use_flash_attn if has_flash_attn else False
config.vision_config.use_flash_attn = True if use_flash_attn else False
config.llm_config.attn_implementation = 'flash_attention_2' if use_flash_attn else 'eager'

GPU capability check from `llama2_flash_attn_monkey_patch.py:132-137`:

cuda_major, cuda_minor = torch.cuda.get_device_capability()
if cuda_major < 8:
    warnings.warn(
        'Flash attention is only supported on A100 or H100 GPU during '
        'training due to head dim > 64 backward.'
    )

Version check from `llama2_flash_attn_monkey_patch.py:9,70`:

from flash_attn import __version__ as flash_attn_version
# Line 70:
flash_attn_version >= '2.1.0'

InternLM2 import guard from `modeling_internlm2.py:51-62,65-79`:

try:
    from flash_attn import flash_attn_func as _flash_attn_func
    from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
    has_flash_attn = True
except:
    has_flash_attn = False

def _import_flash_attn():
    global flash_attn_func, flash_attn_varlen_func
    try:
        from flash_attn import flash_attn_func, flash_attn_varlen_func
    except ImportError:
        raise ImportError('flash_attn is not installed.')

Common Errors

Error Message	Cause	Solution
`FlashAttention2 is not installed.`	`flash-attn` package missing	`pip install flash-attn --no-build-isolation`
`ImportError: flash_attn is not installed.`	InternLM2 attention requires flash-attn	Install flash-attn or switch to eager attention
`Flash attention is only supported on A100 or H100 GPU`	GPU compute capability < 8.0	Use A100/H100/RTX 30xx+ GPU, or disable flash attention
Build errors during `pip install flash-attn`	CUDA toolkit version mismatch	Ensure CUDA toolkit matches PyTorch CUDA version

Compatibility Notes

Fallback: When Flash Attention is not installed, all models fall back to `eager` attention. Training works but is slower and uses more memory.
RoCm (AMD GPUs): Multiple packed training patches contain a TODO: "Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1." RoCm support is experimental.
Architecture-specific config: InternLM2 uses `attn_implementation` while LLaMA uses `_attn_implementation` (with underscore prefix) for flash attention configuration.
Packed Training: Flash Attention varlen functions are required for packed sequence training. Without flash-attn, packed training will not function correctly.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment