Environment:Mit han lab Llm awq Flash Attention Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Optimization |
| Last Updated | 2026-02-15 01:00 GMT |
Overview
Optional Flash Attention 2 dependency (v2.5.8) for efficient long-sequence attention in TinyChat inference.
Description
Flash Attention 2 is an optional but strongly recommended dependency that provides memory-efficient and IO-aware attention computation. In AWQ TinyChat, Flash Attention is used for two scenarios: (1) prefilling with batched queries, and (2) decoding with long sequences (> 8192 tokens) where the built-in fused FasterTransformer kernel would run out of memory. The `flash_attn_func` is imported directly in the fused attention module, LLaMA model, and Qwen2 model implementations.
Usage
Use this environment when deploying models via TinyChat with long sequences (> 8192 tokens) or when running InternVL3 multimodal models. Without Flash Attention, TinyChat falls back to the FasterTransformer fused kernel which is limited to sequences <= 8192 tokens.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA GPU (Ampere or newer recommended) | Flash Attention 2 optimized for sm_80+ |
| CUDA | CUDA 11.6+ | Must match PyTorch CUDA version |
| PyTorch | 2.3.0 | Wheel filename must match installed PyTorch version |
Dependencies
Python Packages
- `flash-attn` == 2.5.8 (recommended version from README)
Credentials
No credentials required for this environment.
Quick Install
# Install Flash Attention (must use --no-build-isolation)
pip install flash-attn --no-build-isolation
Code Evidence
Optional import with fallback from `tinychat/models/internvl3.py:34-39`:
try:
import flash_attn
has_flash_attn = True
except ImportError:
print('FlashAttention2 is not installed.')
has_flash_attn = False
Direct usage in fused attention from `tinychat/modules/fused_attn.py:481`:
output = flash_attn_func(q=xq, k=keys, v=values, causal=True)
Kernel selection threshold from `tinychat/modules/fused_attn.py:358-390`:
# For short seqlence, we use fused kernel to accelerate decoding.
if self.kv_max_seq_len <= 8192:
self.forward = self.short_forward
# For long sequence, we use flash attantion for both prefilling
# and decoding to avoid OOM.
else:
...
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: No module named 'flash_attn'` | Flash Attention not installed | `pip install flash-attn --no-build-isolation` |
| `FlashAttention2 is not installed.` | Missing optional dependency | Install flash-attn; InternVL3 will still work but slower |
| Build failure with flash-attn | PyTorch/CUDA version mismatch in wheel | Check that wheel filename matches PyTorch version; try both `cxx11abiTRUE` and `cxx11abiFALSE` wheels |
Compatibility Notes
- Optional Dependency: TinyChat works without Flash Attention for sequences <= 8192 tokens using the FasterTransformer fused kernel
- Required for Long Sequences: Sequences > 8192 tokens require Flash Attention to avoid CUDA OOM
- InternVL3: Checks for Flash Attention availability; prints warning if missing but continues
- Build Isolation: Must use `--no-build-isolation` flag during pip install