Principle:Sail sg LongSpec Tree Attention Verification

Knowledge Sources	LongSpec: Long-Context Speculative Decoding Flash-Decoding for Long-Context LongSpec
Domains	Attention_Mechanisms, GPU_Kernels, Speculative_Decoding
Last Updated	2026-02-14 05:00 GMT

Overview

Specialized attention mechanism using a Triton GPU kernel to compute masked attention over tree-structured candidate tokens during speculative decoding verification.

Description

Tree Attention Verification is the critical bottleneck in tree speculative decoding: the target LLM must evaluate all candidate tokens in the tree simultaneously. Standard attention mechanisms (causal or Flash Attention) cannot handle the non-causal tree-structured masks required for this verification.

The Tree Attention kernel solves this by:

Implementing tree-masked attention where the attention mask encodes parent-child relationships in the candidate tree
Computing log-sum-exp normalizers (LSE) alongside attention output, enabling efficient prefix merging
Supporting Grouped Query Attention (GQA) where multiple query heads share key-value heads
Using hardware-aware autotuning to select optimal block sizes for different GPU architectures (A100, RTX 3090)

The kernel is combined with standard Flash Attention for the prefix (prompt) portion via sigmoid-weighted LSE merging:

$O_{merged} = σ (w) \cdot O_{prefix} + (1 - σ (w)) \cdot O_{tree}$

where the weights are derived from the log-sum-exp normalizers of both attention computations.

Usage

This principle applies whenever tree speculative decoding verification is performed. The kernel is called internally by the GLIDE model's tree decoding attention layer—users do not call it directly but configure it via the tree shape parameter.

The kernel is most impactful for:

Deep trees (4+ levels) with many candidate tokens
Long contexts where the prefix KV cache is large
Models with GQA (e.g., Llama with num_kv_heads < num_heads)

Theoretical Basis

Standard Scaled Dot-Product Attention:

$Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$

Tree-Masked Attention:

The tree mask M is a binary matrix where M[i,j] = 1 iff token j is an ancestor of token i in the candidate tree. This replaces the standard causal mask:

$TreeAttn (Q, K, V, M) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + (1 - M) \cdot (- \infty)) V$

Prefix Merging:

The final attention output combines prefix attention (Flash Attention) with tree attention (Triton kernel) using the log-sum-exp trick:

# Abstract merging logic (not actual implementation)
o_prefix, lse_prefix = flash_attn(q, k_prefix, v_prefix)  # Standard prefix
o_tree, lse_tree = tree_attn(q, k_tree, v_tree, tree_mask)  # Tree candidates

# Sigmoid-weighted combination based on relative log-normalizers
w = sigmoid(lse_prefix - lse_tree)
o_merged = w * o_prefix + (1 - w) * o_tree

Related Pages

Implemented By

Implementation:Sail_sg_LongSpec_Triton_Tree_Attn_Kernel

Uses Heuristic

Heuristic:Sail_sg_LongSpec_Triton_Block_Size_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment