Principle:Romsto Speculative Decoding Speculative Decoding
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference_Optimization, Language_Models |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
An inference acceleration technique that uses a smaller draft model to propose multiple tokens which are then verified in parallel by a larger target model, preserving the exact output distribution of the target.
Description
Speculative Decoding addresses the fundamental bottleneck in autoregressive language model inference: each token generation requires a full forward pass through the model, and these passes are inherently sequential. Because large model forward passes are often memory-bandwidth-bound rather than compute-bound, the GPU is underutilized during standard autoregressive generation.
The technique works by employing a smaller, faster drafter model to speculatively generate a sequence of gamma candidate tokens. These drafts are then verified against the larger target model in a single forward pass (which processes all gamma positions simultaneously). Using a rejection sampling scheme based on the ratio of target-to-drafter probabilities, the algorithm accepts a prefix of the draft tokens and samples a correction token from an adjusted distribution when a rejection occurs.
The key theoretical guarantee is that the output distribution of speculative decoding is identical to that of the target model alone. This is achieved through the rejection sampling mechanism: a draft token sampled from distribution q is accepted with probability min(1, p/q) where p is the target distribution. If rejected, a replacement token is sampled from the normalized positive part of (p - q), ensuring the marginal distribution matches p exactly.
Usage
Use this principle when accelerating inference from large autoregressive language models and a smaller model from the same family is available as a drafter. It is most effective when the drafter model has high agreement with the target model (leading to high acceptance rates) and when the target model is memory-bandwidth-bound (so verifying gamma tokens costs roughly the same as generating one token).
Theoretical Basis
The algorithm proceeds in rounds. In each round:
- The drafter model generates gamma tokens autoregressively: for each position k, sample
- The target model evaluates all gamma positions in a single forward pass, producing distributions for each position
- For each draft position i = 1, ..., gamma:
- Draw
- If , accept the token
- Otherwise, reject: sample a replacement from and stop
- If all gamma tokens are accepted, sample one additional token from the target model's distribution at the last position
Pseudo-code:
# Abstract speculative decoding algorithm
while not done:
# Step 1: Draft gamma tokens with small model
for k in range(gamma):
q_k = drafter(prefix + drafts[:k])
drafts[k] = sample(q_k)
# Step 2: Verify all drafts with target model (single forward pass)
p = target(prefix + drafts[:gamma]) # parallel evaluation
# Step 3: Rejection sampling
n = gamma # assume all accepted
for i in range(gamma):
r = uniform(0, 1)
if r > p[i, drafts[i]] / q[i, drafts[i]]:
n = i # first rejection at position i
break
# Step 4: Accept prefix, sample correction
accept(drafts[:n])
if n < gamma:
x = sample(norm(max(0, p[n] - q[n]))) # adjusted distribution
else:
x = sample(p[gamma]) # bonus token
append(x)
The expected number of tokens generated per round is where is the average acceptance rate. With a high acceptance rate (e.g., 0.8) and gamma=5, this yields significant speedups over standard autoregressive generation.