Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL Flash Attention Optimization

From Leeroopedia


Knowledge Sources
Domains GPU Optimization, Flash Attention, Memory Efficiency
Last Updated 2026-02-07 14:00 GMT

Overview

The Flash Attention optimization principle that eliminates the need to materialize the full attention matrix by computing attention in a single fused kernel pass using online softmax and tiled block computation, reducing memory usage from O(N^2) to O(N) while maintaining exact computation.

Description

Standard transformer attention computes the full N x N attention matrix, requiring O(N^2) memory which becomes prohibitive for long sequences. Flash Attention solves this through an IO-aware tiling algorithm that:

1. Divides computation into blocks: Query, key, and value tensors are processed in tiles (typically 128x128), keeping only the current block in fast SRAM rather than materializing the full attention matrix in slow HBM (High Bandwidth Memory).

2. Uses online softmax: Rather than computing the softmax denominator over the full sequence, the algorithm maintains running log-sum-exp (LSE) statistics that are updated incrementally as each key block is processed. The final output is rescaled to produce the exact same result as standard attention.

3. Fuses operations into a single kernel: The QK^T multiplication, softmax, and V multiplication are performed in a single GPU kernel launch, avoiding multiple reads/writes to HBM.

4. Supports backward pass: The backward kernels recompute attention weights from saved LSE values rather than storing them, trading compute for memory. An optional sequence-parallel mode using atomic adds improves parallelism for small batch sizes.

InternVL employs Flash Attention across multiple model backends:

  • InternLM2 and Phi-3 use the flash-attn library (Flash Attention 2) with variable-length support via flash_attn_varlen_func
  • MPT provides a Triton-compiled Flash Attention implementation with support for attention bias (ALiBi), which the CUDA Flash Attention does not support
  • All implementations maintain fallback to eager attention when the flash-attn package is not installed

Usage

Apply Flash Attention optimization whenever sequence lengths are long enough to benefit from reduced memory usage (typically sequences > 512 tokens). Choose the appropriate backend: flash-attn library for standard use, Triton implementation when attention bias (ALiBi) is needed, or eager attention as a fallback for debugging or unsupported configurations.

Theoretical Basis

Flash Attention is based on the IO-complexity analysis of attention computation:

  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022) established the foundational algorithm showing that attention can be computed in O(N^2 d / M) HBM accesses (where M is SRAM size) instead of O(N^2 + N^2 d), a significant improvement.
  • FlashAttention-2 (Dao, 2023) improved on the original with better work partitioning across GPU thread blocks and warps.
  • The Triton implementation allows custom kernel compilation and supports features (attention bias) that the CUDA kernels do not, at the cost of slightly slower backward passes.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment