Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Mit han lab Llm awq Flash Attention Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Optimization
Last Updated 2026-02-15 01:00 GMT

Overview

Optional Flash Attention 2 dependency (v2.5.8) for efficient long-sequence attention in TinyChat inference.

Description

Flash Attention 2 is an optional but strongly recommended dependency that provides memory-efficient and IO-aware attention computation. In AWQ TinyChat, Flash Attention is used for two scenarios: (1) prefilling with batched queries, and (2) decoding with long sequences (> 8192 tokens) where the built-in fused FasterTransformer kernel would run out of memory. The `flash_attn_func` is imported directly in the fused attention module, LLaMA model, and Qwen2 model implementations.

Usage

Use this environment when deploying models via TinyChat with long sequences (> 8192 tokens) or when running InternVL3 multimodal models. Without Flash Attention, TinyChat falls back to the FasterTransformer fused kernel which is limited to sequences <= 8192 tokens.

System Requirements

Category Requirement Notes
Hardware NVIDIA GPU (Ampere or newer recommended) Flash Attention 2 optimized for sm_80+
CUDA CUDA 11.6+ Must match PyTorch CUDA version
PyTorch 2.3.0 Wheel filename must match installed PyTorch version

Dependencies

Python Packages

  • `flash-attn` == 2.5.8 (recommended version from README)

Credentials

No credentials required for this environment.

Quick Install

# Install Flash Attention (must use --no-build-isolation)
pip install flash-attn --no-build-isolation

Code Evidence

Optional import with fallback from `tinychat/models/internvl3.py:34-39`:

try:
    import flash_attn
    has_flash_attn = True
except ImportError:
    print('FlashAttention2 is not installed.')
    has_flash_attn = False

Direct usage in fused attention from `tinychat/modules/fused_attn.py:481`:

output = flash_attn_func(q=xq, k=keys, v=values, causal=True)

Kernel selection threshold from `tinychat/modules/fused_attn.py:358-390`:

# For short seqlence, we use fused kernel to accelerate decoding.
if self.kv_max_seq_len <= 8192:
    self.forward = self.short_forward
# For long sequence, we use flash attantion for both prefilling
# and decoding to avoid OOM.
else:
    ...

Common Errors

Error Message Cause Solution
`ImportError: No module named 'flash_attn'` Flash Attention not installed `pip install flash-attn --no-build-isolation`
`FlashAttention2 is not installed.` Missing optional dependency Install flash-attn; InternVL3 will still work but slower
Build failure with flash-attn PyTorch/CUDA version mismatch in wheel Check that wheel filename matches PyTorch version; try both `cxx11abiTRUE` and `cxx11abiFALSE` wheels

Compatibility Notes

  • Optional Dependency: TinyChat works without Flash Attention for sequences <= 8192 tokens using the FasterTransformer fused kernel
  • Required for Long Sequences: Sequences > 8192 tokens require Flash Attention to avoid CUDA OOM
  • InternVL3: Checks for Flash Attention availability; prints warning if missing but continues
  • Build Isolation: Must use `--no-build-isolation` flag during pip install

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment