Environment:Mit han lab Llm awq Flash Attention Environment

Knowledge Sources	llm-awq flash-attention
Domains	Infrastructure, Optimization
Last Updated	2026-02-15 01:00 GMT

Overview

Optional Flash Attention 2 dependency (v2.5.8) for efficient long-sequence attention in TinyChat inference.

Description

Flash Attention 2 is an optional but strongly recommended dependency that provides memory-efficient and IO-aware attention computation. In AWQ TinyChat, Flash Attention is used for two scenarios: (1) prefilling with batched queries, and (2) decoding with long sequences (> 8192 tokens) where the built-in fused FasterTransformer kernel would run out of memory. The `flash_attn_func` is imported directly in the fused attention module, LLaMA model, and Qwen2 model implementations.

Usage

Use this environment when deploying models via TinyChat with long sequences (> 8192 tokens) or when running InternVL3 multimodal models. Without Flash Attention, TinyChat falls back to the FasterTransformer fused kernel which is limited to sequences <= 8192 tokens.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA GPU (Ampere or newer recommended)	Flash Attention 2 optimized for sm_80+
CUDA	CUDA 11.6+	Must match PyTorch CUDA version
PyTorch	2.3.0	Wheel filename must match installed PyTorch version

Dependencies

Python Packages

`flash-attn` == 2.5.8 (recommended version from README)

Credentials

No credentials required for this environment.

Quick Install

# Install Flash Attention (must use --no-build-isolation)
pip install flash-attn --no-build-isolation

Code Evidence

Optional import with fallback from `tinychat/models/internvl3.py:34-39`:

try:
    import flash_attn
    has_flash_attn = True
except ImportError:
    print('FlashAttention2 is not installed.')
    has_flash_attn = False

Direct usage in fused attention from `tinychat/modules/fused_attn.py:481`:

output = flash_attn_func(q=xq, k=keys, v=values, causal=True)

Kernel selection threshold from `tinychat/modules/fused_attn.py:358-390`:

# For short seqlence, we use fused kernel to accelerate decoding.
if self.kv_max_seq_len <= 8192:
    self.forward = self.short_forward
# For long sequence, we use flash attantion for both prefilling
# and decoding to avoid OOM.
else:
    ...

Common Errors

Error Message	Cause	Solution
`ImportError: No module named 'flash_attn'`	Flash Attention not installed	`pip install flash-attn --no-build-isolation`
`FlashAttention2 is not installed.`	Missing optional dependency	Install flash-attn; InternVL3 will still work but slower
Build failure with flash-attn	PyTorch/CUDA version mismatch in wheel	Check that wheel filename matches PyTorch version; try both `cxx11abiTRUE` and `cxx11abiFALSE` wheels

Compatibility Notes

Optional Dependency: TinyChat works without Flash Attention for sequences <= 8192 tokens using the FasterTransformer fused kernel
Required for Long Sequences: Sequences > 8192 tokens require Flash Attention to avoid CUDA OOM
InternVL3: Checks for Flash Attention availability; prints warning if missing but continues
Build Isolation: Must use `--no-build-isolation` flag during pip install

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment