Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Romsto Speculative Decoding Speculative Decoding

From Leeroopedia
Revision as of 18:08, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Romsto_Speculative_Decoding_Speculative_Decoding.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, Inference_Optimization, Language_Models
Last Updated 2026-02-14 04:30 GMT

Overview

An inference acceleration technique that uses a smaller draft model to propose multiple tokens which are then verified in parallel by a larger target model, preserving the exact output distribution of the target.

Description

Speculative Decoding addresses the fundamental bottleneck in autoregressive language model inference: each token generation requires a full forward pass through the model, and these passes are inherently sequential. Because large model forward passes are often memory-bandwidth-bound rather than compute-bound, the GPU is underutilized during standard autoregressive generation.

The technique works by employing a smaller, faster drafter model to speculatively generate a sequence of gamma candidate tokens. These drafts are then verified against the larger target model in a single forward pass (which processes all gamma positions simultaneously). Using a rejection sampling scheme based on the ratio of target-to-drafter probabilities, the algorithm accepts a prefix of the draft tokens and samples a correction token from an adjusted distribution when a rejection occurs.

The key theoretical guarantee is that the output distribution of speculative decoding is identical to that of the target model alone. This is achieved through the rejection sampling mechanism: a draft token sampled from distribution q is accepted with probability min(1, p/q) where p is the target distribution. If rejected, a replacement token is sampled from the normalized positive part of (p - q), ensuring the marginal distribution matches p exactly.

Usage

Use this principle when accelerating inference from large autoregressive language models and a smaller model from the same family is available as a drafter. It is most effective when the drafter model has high agreement with the target model (leading to high acceptance rates) and when the target model is memory-bandwidth-bound (so verifying gamma tokens costs roughly the same as generating one token).

Theoretical Basis

The algorithm proceeds in rounds. In each round:

  1. The drafter model generates gamma tokens autoregressively: for each position k, sample xkq(|x<k)
  2. The target model evaluates all gamma positions in a single forward pass, producing distributions p(|x<k) for each position
  3. For each draft position i = 1, ..., gamma:
    1. Draw rUniform(0,1)
    2. If r<p(xi|x<i)q(xi|x<i), accept the token
    3. Otherwise, reject: sample a replacement from norm(max(0,p(|x<i)q(|x<i))) and stop
  4. If all gamma tokens are accepted, sample one additional token from the target model's distribution at the last position

Pseudo-code:

# Abstract speculative decoding algorithm
while not done:
    # Step 1: Draft gamma tokens with small model
    for k in range(gamma):
        q_k = drafter(prefix + drafts[:k])
        drafts[k] = sample(q_k)

    # Step 2: Verify all drafts with target model (single forward pass)
    p = target(prefix + drafts[:gamma])  # parallel evaluation

    # Step 3: Rejection sampling
    n = gamma  # assume all accepted
    for i in range(gamma):
        r = uniform(0, 1)
        if r > p[i, drafts[i]] / q[i, drafts[i]]:
            n = i  # first rejection at position i
            break

    # Step 4: Accept prefix, sample correction
    accept(drafts[:n])
    if n < gamma:
        x = sample(norm(max(0, p[n] - q[n])))  # adjusted distribution
    else:
        x = sample(p[gamma])  # bonus token
    append(x)

The expected number of tokens generated per round is E[tokens]=1αγ+11α where α is the average acceptance rate. With a high acceptance rate (e.g., 0.8) and gamma=5, this yields significant speedups over standard autoregressive generation.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment