Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Vllm project Vllm Speculative Generation

From Leeroopedia


Knowledge Sources
Domains LLM Inference, Speculative Decoding, Text Generation
Last Updated 2026-02-08 13:00 GMT

Overview

Speculative generation is the process of producing text from a language model using draft-then-verify parallelism, where the generation API is transparent to the speculation mechanism and produces output mathematically equivalent to standard autoregressive decoding.

Description

The central insight of speculative decoding is that the user-facing generation API remains completely unchanged when speculation is enabled. The same generate() call, with the same prompts and sampling parameters, produces the same output distribution. The only difference is that the engine internally uses the draft-then-verify loop to produce tokens faster.

This transparency is a deliberate design principle in vLLM. The LLM.generate() method accepts prompts and sampling parameters and returns a list of RequestOutput objects. Whether speculation is enabled or disabled, the method signature, input format, and output format are identical. The speculation strategy is entirely encapsulated within the engine, which was configured at initialization time.

The mathematical equivalence guarantee means that for greedy decoding (temperature=0), the speculative output is bit-identical to non-speculative output. For sampling (temperature>0), the output distribution is identical due to the rejection sampling algorithm used during verification.

Usage

Use this principle when performing inference with a speculative decoding engine. The key insight for application developers is that no code changes are needed in the generation path itself. All speculative behavior is configured at engine initialization time.

Theoretical Basis

The Draft-Verify Loop

For each generation step, the speculative engine executes the following loop:

  1. Draft phase: The draft mechanism (EAGLE head, n-gram matcher, MTP head, or draft model) produces K candidate tokens t_1, t_2, ..., t_K
  2. Verification phase: The target model processes all K candidates in a single forward pass, computing the probability distribution at each position
  3. Acceptance/rejection: Each candidate is accepted or rejected sequentially:
for i in 1..K:
    if uniform_random() < min(1, p_target(t_i) / q_draft(t_i)):
        accept t_i  # keep this token
    else:
        sample t_i' from adjusted distribution:
            p_adjusted(t) = max(0, p_target(t) - q_draft(t)) / Z
        reject t_i and all subsequent candidates
        emit t_i' as the corrective token
        break

if all K tokens accepted:
    sample bonus token t_{K+1} from p_target(t_{K+1} | t_1..t_K)

Why Verification is Cheap

The key to speculative decoding's efficiency is that the target model's forward pass can verify all K candidate tokens in O(1) forward passes rather than O(K) forward passes. This is because:

  • The target model processes the entire sequence [prompt, t_1, ..., t_K] as a batch
  • Each position's logits are computed in parallel via the transformer's attention mechanism
  • The KV cache from the prompt is reused, and new KV entries for accepted tokens are retained

The cost ratio determines whether speculation is beneficial:

Speedup condition:
  (1 + E[accepted_tokens]) / (K * cost_draft + cost_verify) > 1 / cost_verify

Simplifying:
  E[accepted_tokens] > K * cost_draft / cost_verify

In practice:
  - EAGLE: cost_draft << cost_verify (lightweight head), so even moderate acceptance rates yield speedup
  - N-gram: cost_draft ~= 0, so any acceptance rate > 0 yields speedup
  - Draft model: cost_draft is non-trivial, requiring higher acceptance rates

Greedy Decoding Equivalence

For greedy decoding (temperature=0), the verification simplifies to:

for i in 1..K:
    if argmax(p_target(position_i)) == t_i:
        accept t_i
    else:
        emit argmax(p_target(position_i)) as corrective token
        break

This produces bit-identical output because the argmax operation is deterministic. The output is exactly what the target model would have produced through standard autoregressive decoding.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment