Principle:Sail sg LongSpec Token Acceptance

Knowledge Sources	Fast Inference from Transformers via Speculative Decoding LongSpec LongSpec
Domains	Speculative_Decoding, Sampling, LLM_Inference
Last Updated	2026-02-14 05:00 GMT

Overview

Decision procedure that determines which draft model tokens to accept based on agreement with the target LLM, supporting both greedy (deterministic) and stochastic (rejection sampling) verification modes.

Description

Token Acceptance is the final step in speculative decoding, deciding which of the draft model's proposed tokens are valid. After the target LLM has scored all candidate tokens (from either a tree or sequential chain), the acceptance algorithm traverses the candidates and determines the longest acceptable prefix.

Two modes are supported:

Greedy verification (temperature = 0): A token is accepted if and only if the target LLM's argmax prediction matches the draft token. This is deterministic and guarantees the output is identical to standard greedy decoding.
Stochastic verification (temperature > 0): Uses rejection sampling where a draft token x is accepted with probability min(1, P_target(x) / P_draft(x)). If rejected, a new token is sampled from a corrected distribution. This guarantees the output distribution matches the target LLM's distribution exactly.

For tree-structured candidates, the algorithm traverses the tree depth-first, finding the deepest accepted node. It also handles KV cache rearrangement — after acceptance, the KV cache must be reorganized to reflect only the accepted token sequence.

Usage

This principle is applied automatically during every speculative decoding step. The choice between greedy and stochastic mode is controlled by the temperature parameter:

temperature = 0.0: Greedy verification (deterministic, exact match)
temperature > 0.0: Stochastic verification (sampling, distribution-preserving)

Theoretical Basis

Greedy Acceptance:

For tree-structured candidates with tree mask M and father indices f:

# Abstract greedy verification (not actual implementation)
for node in tree_order:
    parent = father_index[node]
    target_pred = argmax(target_logits[parent])
    draft_token = draft_tokens[node]
    accepted[node] = (target_pred == draft_token) and accepted[parent]

The longest accepted path from root gives the number of "free" tokens.

Stochastic Acceptance (Rejection Sampling):

For each candidate token x:

$P (accept x) = \min (1, \frac{P_{target} (x)}{P_{draft} (x)})$

If rejected, sample from the residual distribution:

$P_{resample} (x) \propto \max (0, P_{target} (x) - P_{draft} (x))$

This guarantees that the final output distribution equals the target LLM's distribution regardless of draft model quality.

Related Pages

Implemented By

Implementation:Sail_sg_LongSpec_Tree_Verification_Accept

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment