Environment:OpenGVLab InternVL Flash Attention 2
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Optimization, Deep_Learning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Flash Attention 2 acceleration library requiring NVIDIA GPU with compute capability >= 8.0 (Ampere/Hopper) for efficient attention computation in InternVL training and inference.
Description
Flash Attention 2 provides memory-efficient and IO-aware exact attention computation. InternVL uses it in three locations: the InternViT vision encoder (via `flash_attn_varlen_qkvpacked_func`), the InternLM2 language model (via `flash_attn_func` and `flash_attn_varlen_func`), and the LLaMA/Phi-3/Qwen2 language models (via `flash_attn_varlen_kvpacked_func`). The library is optional but strongly recommended: the code auto-detects its availability and falls back to eager attention when not installed.
Usage
Use this environment when training or fine-tuning InternVL models with GPU acceleration. Flash Attention 2 is automatically enabled when available and provides significant speedup and memory savings. It is required for packed sequence training (varlen attention) and recommended for all training workflows.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA GPU with compute capability >= 8.0 | A100, H100, RTX 3090/4090 (Ampere/Hopper/Ada) |
| CUDA | CUDA 11.6+ | Required for Flash Attention compilation |
| OS | Linux | Windows not supported for flash-attn |
Dependencies
Python Packages
- `flash-attn` >= 2.1.0 (version 2.8.2 used in GPT-OSS variant)
- `torch` >= 2.0 (prerequisite)
Submodules Used
- `flash_attn.flash_attn_interface.flash_attn_func`
- `flash_attn.flash_attn_interface.flash_attn_varlen_func`
- `flash_attn.flash_attn_interface.flash_attn_varlen_qkvpacked_func`
- `flash_attn.flash_attn_interface.flash_attn_varlen_kvpacked_func`
- `flash_attn.bert_padding.pad_input`
- `flash_attn.bert_padding.unpad_input`
Credentials
No credentials required.
Quick Install
# Install Flash Attention 2 (requires CUDA toolkit and compatible GPU)
pip install flash-attn --no-build-isolation
Code Evidence
Flash Attention availability check from `modeling_intern_vit.py:23-30`:
try:
from flash_attn.bert_padding import pad_input, unpad_input
from flash_attn.flash_attn_interface import \
flash_attn_varlen_qkvpacked_func
has_flash_attn = True
except:
print('FlashAttention2 is not installed.')
has_flash_attn = False
Auto-selection logic from `modeling_internvl_chat.py:62-64`:
use_flash_attn = use_flash_attn if has_flash_attn else False
config.vision_config.use_flash_attn = True if use_flash_attn else False
config.llm_config.attn_implementation = 'flash_attention_2' if use_flash_attn else 'eager'
GPU capability check from `llama2_flash_attn_monkey_patch.py:132-137`:
cuda_major, cuda_minor = torch.cuda.get_device_capability()
if cuda_major < 8:
warnings.warn(
'Flash attention is only supported on A100 or H100 GPU during '
'training due to head dim > 64 backward.'
)
Version check from `llama2_flash_attn_monkey_patch.py:9,70`:
from flash_attn import __version__ as flash_attn_version
# Line 70:
flash_attn_version >= '2.1.0'
InternLM2 import guard from `modeling_internlm2.py:51-62,65-79`:
try:
from flash_attn import flash_attn_func as _flash_attn_func
from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
has_flash_attn = True
except:
has_flash_attn = False
def _import_flash_attn():
global flash_attn_func, flash_attn_varlen_func
try:
from flash_attn import flash_attn_func, flash_attn_varlen_func
except ImportError:
raise ImportError('flash_attn is not installed.')
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `FlashAttention2 is not installed.` | `flash-attn` package missing | `pip install flash-attn --no-build-isolation` |
| `ImportError: flash_attn is not installed.` | InternLM2 attention requires flash-attn | Install flash-attn or switch to eager attention |
| `Flash attention is only supported on A100 or H100 GPU` | GPU compute capability < 8.0 | Use A100/H100/RTX 30xx+ GPU, or disable flash attention |
| Build errors during `pip install flash-attn` | CUDA toolkit version mismatch | Ensure CUDA toolkit matches PyTorch CUDA version |
Compatibility Notes
- Fallback: When Flash Attention is not installed, all models fall back to `eager` attention. Training works but is slower and uses more memory.
- RoCm (AMD GPUs): Multiple packed training patches contain a TODO: "Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1." RoCm support is experimental.
- Architecture-specific config: InternLM2 uses `attn_implementation` while LLaMA uses `_attn_implementation` (with underscore prefix) for flash attention configuration.
- Packed Training: Flash Attention varlen functions are required for packed sequence training. Without flash-attn, packed training will not function correctly.