Principle:Romsto Speculative Decoding Speculative Decoding

Knowledge Sources	Fast Inference from Transformers via Speculative Decoding Accelerating Large Language Model Decoding with Speculative Sampling Speculative Decoding
Domains	NLP, Inference_Optimization, Language_Models
Last Updated	2026-02-14 04:30 GMT

Overview

An inference acceleration technique that uses a smaller draft model to propose multiple tokens which are then verified in parallel by a larger target model, preserving the exact output distribution of the target.

Description

Speculative Decoding addresses the fundamental bottleneck in autoregressive language model inference: each token generation requires a full forward pass through the model, and these passes are inherently sequential. Because large model forward passes are often memory-bandwidth-bound rather than compute-bound, the GPU is underutilized during standard autoregressive generation.

The technique works by employing a smaller, faster drafter model to speculatively generate a sequence of gamma candidate tokens. These drafts are then verified against the larger target model in a single forward pass (which processes all gamma positions simultaneously). Using a rejection sampling scheme based on the ratio of target-to-drafter probabilities, the algorithm accepts a prefix of the draft tokens and samples a correction token from an adjusted distribution when a rejection occurs.

The key theoretical guarantee is that the output distribution of speculative decoding is identical to that of the target model alone. This is achieved through the rejection sampling mechanism: a draft token sampled from distribution q is accepted with probability min(1, p/q) where p is the target distribution. If rejected, a replacement token is sampled from the normalized positive part of (p - q), ensuring the marginal distribution matches p exactly.

Usage

Use this principle when accelerating inference from large autoregressive language models and a smaller model from the same family is available as a drafter. It is most effective when the drafter model has high agreement with the target model (leading to high acceptance rates) and when the target model is memory-bandwidth-bound (so verifying gamma tokens costs roughly the same as generating one token).

Theoretical Basis

The algorithm proceeds in rounds. In each round:

The drafter model generates gamma tokens autoregressively: for each position k, sample $x_{k} \sim q (\cdot | x_{< k})$
The target model evaluates all gamma positions in a single forward pass, producing distributions $p (\cdot | x_{< k})$ for each position
For each draft position i = 1, ..., gamma:
1. Draw $r \sim Uniform (0, 1)$
2. If $r < \frac{p (x_{i} | x_{< i})}{q (x_{i} | x_{< i})}$ , accept the token
3. Otherwise, reject: sample a replacement from $norm (\max (0, p (\cdot | x_{< i}) - q (\cdot | x_{< i})))$ and stop
If all gamma tokens are accepted, sample one additional token from the target model's distribution at the last position

Pseudo-code:

# Abstract speculative decoding algorithm
while not done:
    # Step 1: Draft gamma tokens with small model
    for k in range(gamma):
        q_k = drafter(prefix + drafts[:k])
        drafts[k] = sample(q_k)

    # Step 2: Verify all drafts with target model (single forward pass)
    p = target(prefix + drafts[:gamma])  # parallel evaluation

    # Step 3: Rejection sampling
    n = gamma  # assume all accepted
    for i in range(gamma):
        r = uniform(0, 1)
        if r > p[i, drafts[i]] / q[i, drafts[i]]:
            n = i  # first rejection at position i
            break

    # Step 4: Accept prefix, sample correction
    accept(drafts[:n])
    if n < gamma:
        x = sample(norm(max(0, p[n] - q[n])))  # adjusted distribution
    else:
        x = sample(p[gamma])  # bonus token
    append(x)

The expected number of tokens generated per round is $E [tokens] = \frac{1 - α^{γ + 1}}{1 - α}$ where $α$ is the average acceptance rate. With a high acceptance rate (e.g., 0.8) and gamma=5, this yields significant speedups over standard autoregressive generation.

Related Pages

Implemented By

Implementation:Romsto_Speculative_Decoding_Speculative_Generate

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment