Heuristic:Romsto Speculative Decoding Gamma Tuning

Knowledge Sources	Romsto Speculative-Decoding Fast Inference from Transformers via Speculative Decoding
Domains	LLMs, Optimization, Speculative_Decoding
Last Updated	2026-02-14 04:30 GMT

Overview

Tuning the gamma (draft count) hyperparameter for speculative decoding to balance acceptance rate against per-step overhead.

Description

Gamma (γ) is the number of draft tokens the drafter model generates at each speculative step before the target model verifies them in parallel. Increasing gamma does not always lead to faster generation because more drafts may be rejected by the target model. The key metric is the acceptance rate (α): the fraction of drafts accepted. The optimal gamma depends on the similarity between the drafter and target model distributions — a well-aligned drafter tolerates higher gamma, while a poorly-aligned one wastes compute on rejected drafts.

Usage

Use this heuristic when configuring speculative decoding for a new model pair. If the acceptance rate (reported by `speculative_generate`) is low (e.g., below 0.5), consider reducing gamma. If the acceptance rate is high (e.g., above 0.8), try increasing gamma to generate more tokens per step.

The Insight (Rule of Thumb)

Action: Set gamma via the `/gamma <value>` CLI command or the `gamma` parameter in `speculative_generate()`.
Default Value: `gamma=4` in the CLI (`infer.py:28`), `gamma=5` in the API functions (`speculative_decoding.py:29`).
Tuning Strategy: Start with gamma=4. Monitor the acceptance rate α. If α > 0.8, increase gamma by 1-2. If α < 0.5, decrease gamma.
Trade-off: Higher gamma means more parallel verification but more wasted compute on rejected tokens. Lower gamma means fewer rejections but fewer tokens per forward pass.

Reasoning

The speculative decoding algorithm generates γ draft tokens, then verifies all γ in a single target model forward pass. If n out of γ drafts are accepted (n ≤ γ), you get n+1 tokens for the cost of γ drafter forward passes + 1 target forward pass. The speedup comes from the drafter being much cheaper than the target.

However, each rejected draft beyond position n is wasted compute. The acceptance rate depends on how well the drafter approximates the target's distribution. For closely related models (e.g., Llama-3.2-1B vs 3B from the same family), α is typically high. For dissimilar models, α drops rapidly with increasing γ.

From `README.md`:

"Increasing the value of γ will not always lead to a faster generation, as the drafts may be rejected more. The acceptance rate α is the number of drafts accepted by the target model divided by the number of drafts generated. The higher the acceptance rate, the faster the generation."

Code evidence from `sampling/speculative_decoding.py:106`:

corrected_gamma = min(gamma, total_len - current_position - 1)
q = torch.zeros((1, corrected_gamma, vocabulary_size), device=target.device)

Code evidence from `infer.py:28` and `infer.py:152-157`:

self.gamma = 4  # Default gamma in CLI

# Runtime gamma adjustment
if args[0] == "/gamma":
    self.gamma = int(args[1])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment