Heuristic:Sail sg LongSpec Tree Shape Configuration

Knowledge Sources	LongSpec LongSpec
Domains	Speculative_Decoding, LLM_Inference, Optimization
Last Updated	2026-02-14 06:00 GMT

Overview

Default tree shape `[4, 16, 16, 16, 16]` for speculative decoding tree construction, producing 69 candidate tokens per verification step with a narrow first level and wide subsequent levels.

Description

The `tree_shape` parameter controls the branching factor at each level of the speculation tree used in LongSpec's tree-structured speculative decoding. The default `[4, 16, 16, 16, 16]` means: generate 4 candidates at depth 1, then 16 top-k candidates at each of depths 2-5. This creates a total of 4 + 16 + 16 + 16 + 16 = 68 candidate nodes plus the root, totaling 69 positions that the target model verifies in a single forward pass.

The narrow first level (4) limits the initial branching to keep the tree manageable, while the wider subsequent levels (16) explore more token paths to increase the acceptance rate.

Usage

Apply this heuristic when configuring speculative decoding inference with LongSpec. The default `[4, 16, 16, 16, 16]` is used across all benchmark evaluations (LongBench and QwQ/AIME). Adjust the tree shape if you experience VRAM constraints (reduce values) or want to experiment with acceptance rates (increase values).

The Insight (Rule of Thumb)

Action: Set `--tree_shape 4 16 16 16 16` for tree speculative decoding inference.
Value: 69 total candidate tokens per verification step (4+16+16+16+16+1).
Trade-off: Larger tree shapes increase the probability of finding a longer accepted path but consume more VRAM for KV cache and tree mask storage. The tree mask is (batch, M, N) = (1, 69, 69) in float16. The target model must process all 69 tokens in a single forward pass.
Buffer: The code allocates an extra 128-256 token buffer beyond `max_gen_len` to accommodate tree overhead (`set_max_gen_len(max_gen_len + 256)` for tree generation vs. `+ 128` for sequential).

Reasoning

Tree-structured speculation is more efficient than sequential speculation because the target model can verify an exponential number of paths in a single forward pass using the custom Triton tree attention kernel. The `[4, 16, 16, 16, 16]` shape balances:

Memory: 69 tokens fits comfortably in the attention mechanism alongside the full prefix KV cache.
Throughput: The tree mask enables batch verification of all paths simultaneously.
Acceptance rate: Wider levels at depth 2+ give more candidates to match the target model's distribution.

The warm-up run before benchmarking (visible in both `inference_long-bench.py` and `inference_qwq.py`) ensures CUDA kernels are compiled and cached before timing begins.

Code Evidence

Default tree shape from `llama_glide.py:931` (train) and `inference_qwq.py:25` (test):

# In model code:
if tree_shape is None:
    cand_num_per_step = [4, 16, 16, 16, 16]

# In CLI:
parser.add_argument('--tree_shape', nargs='+', type=int,
    default=[4, 16, 16, 16, 16],
    help='A list of tree size (default: [4, 16, 16, 16, 16])')

Buffer allocation from `llama_glide.py:926-927`:

# Tree generation needs extra buffer for candidate tokens
self.set_max_gen_len(max_gen_len + 256)

Warm-up run before benchmarking from `inference_long-bench.py:233-240`:

with torch.inference_mode():
    # warm up
    output_ids, count, num, elapsed_time, spec_mask = llama_glide.tree_spec_generate(
        meta_prompts[0]["input_ids"],
        prompt_length=meta_prompts[0]["length"],
        max_gen_len=args.max_gen_len,
        tree_shape=args.tree_shape,
        temperature=args.temperature,
    )
    # real run
    for i in range(args.test_length):
        ...

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment