Heuristic:Lucidrains X transformers Flash Attention Configuration

Knowledge Sources	x-transformers README guidance
Domains	Optimization, Deep_Learning
Last Updated	2026-02-08 18:00 GMT

Overview

Configuration guide for enabling flash attention and understanding its feature incompatibilities in x-transformers.

Description

Flash attention (`attn_flash=True`) uses PyTorch's native scaled dot-product attention (SDPA) for significantly faster and more memory-efficient attention computation. However, it is incompatible with many advanced attention features that require direct access to the attention matrix. Understanding these trade-offs is essential for choosing the right configuration.

Usage

Use this heuristic when configuring any transformer model with `attn_flash` and encountering assertion errors, or when deciding whether to enable flash attention for a particular feature set.

The Insight (Rule of Thumb)

Action: Set `attn_flash=True` on the `Decoder` or `Encoder` attention layers for maximum speed.
Value: Boolean flag; no tuning needed.
Trade-off: Flash attention is incompatible with these features:
- T5 relative position bias
- Dynamic position bias
- Residual attention (`residual_attn`, `cross_residual_attn`)
- CoPE (Contextual Positional Encoding)
- Talking heads (pre/post softmax)
- Selective attention
- Sigmoid attention
- Hard attention
- Sparse top-k attention
- Cog signed attention
- Logit softclamp
- Head learned sinks
Compatible with: Rotary embeddings, ALiBi, standard masking, GQA/MQA, QK normalization.
Packed sequences: Require the `flash-attn` package (>= 2.0), SM80+ GPU, and rotary positional embeddings.

Reasoning

Flash attention fuses the attention computation into a single GPU kernel, avoiding materialization of the full attention matrix. Any feature that needs to read or modify the attention matrix (talking heads, residual attention, softclamp, etc.) cannot work with this fused kernel. The README states: "Avoid flash attention only if you require operating on the attention matrix."

The incompatibility assertions are checked at initialization time in `attend.py` and `x_transformers.py`, providing immediate feedback rather than silent incorrect behavior.

Code Evidence

Feature incompatibility assertions from `attend.py:216-290`:

assert not (flash and sigmoid), 'sigmoid attention not available for flash'
assert not (flash and hard), 'hard attention not available for flash'
assert not (flash and is_sparse_topk_attn), 'topk attention not available for flash'
assert not (flash and (pre_talking_heads or post_talking_heads or pre_scale_post_talking_heads)), 'talking heads not compatible with flash attention'
assert not (flash and selective), 'selective attention cannot work on flash attention'
assert not (flash and cog_signed), 'cog attention not available for flash'
assert not (head_learned_sink and flash), f'not supported for flash attention yet'

Residual attention incompatibility from `x_transformers.py:2395`:

assert not (flash_attn and (residual_attn or cross_residual_attn)), 'flash attention is not compatible with residual attention'

Packed sequence constraints from `x_transformers.py:2360`:

assert not flash_pack_seq or rotary_pos_emb, 'block masking only tested for rotary positional embeddings'

KV cache optimization from `attend.py:375-379`:

# in the case of kv cached one token (q_len == 1), just turn off causal masking
# in speculative decoding, this may go up to 5-6, so right aligned causal mask will be needed there
if q_len == 1 and causal:
    causal = False

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment