Principle:Vllm project Vllm Speculative Generation
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Speculative Decoding, Text Generation |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Speculative generation is the process of producing text from a language model using draft-then-verify parallelism, where the generation API is transparent to the speculation mechanism and produces output mathematically equivalent to standard autoregressive decoding.
Description
The central insight of speculative decoding is that the user-facing generation API remains completely unchanged when speculation is enabled. The same generate() call, with the same prompts and sampling parameters, produces the same output distribution. The only difference is that the engine internally uses the draft-then-verify loop to produce tokens faster.
This transparency is a deliberate design principle in vLLM. The LLM.generate() method accepts prompts and sampling parameters and returns a list of RequestOutput objects. Whether speculation is enabled or disabled, the method signature, input format, and output format are identical. The speculation strategy is entirely encapsulated within the engine, which was configured at initialization time.
The mathematical equivalence guarantee means that for greedy decoding (temperature=0), the speculative output is bit-identical to non-speculative output. For sampling (temperature>0), the output distribution is identical due to the rejection sampling algorithm used during verification.
Usage
Use this principle when performing inference with a speculative decoding engine. The key insight for application developers is that no code changes are needed in the generation path itself. All speculative behavior is configured at engine initialization time.
Theoretical Basis
The Draft-Verify Loop
For each generation step, the speculative engine executes the following loop:
- Draft phase: The draft mechanism (EAGLE head, n-gram matcher, MTP head, or draft model) produces K candidate tokens t_1, t_2, ..., t_K
- Verification phase: The target model processes all K candidates in a single forward pass, computing the probability distribution at each position
- Acceptance/rejection: Each candidate is accepted or rejected sequentially:
for i in 1..K:
if uniform_random() < min(1, p_target(t_i) / q_draft(t_i)):
accept t_i # keep this token
else:
sample t_i' from adjusted distribution:
p_adjusted(t) = max(0, p_target(t) - q_draft(t)) / Z
reject t_i and all subsequent candidates
emit t_i' as the corrective token
break
if all K tokens accepted:
sample bonus token t_{K+1} from p_target(t_{K+1} | t_1..t_K)
Why Verification is Cheap
The key to speculative decoding's efficiency is that the target model's forward pass can verify all K candidate tokens in O(1) forward passes rather than O(K) forward passes. This is because:
- The target model processes the entire sequence [prompt, t_1, ..., t_K] as a batch
- Each position's logits are computed in parallel via the transformer's attention mechanism
- The KV cache from the prompt is reused, and new KV entries for accepted tokens are retained
The cost ratio determines whether speculation is beneficial:
Speedup condition:
(1 + E[accepted_tokens]) / (K * cost_draft + cost_verify) > 1 / cost_verify
Simplifying:
E[accepted_tokens] > K * cost_draft / cost_verify
In practice:
- EAGLE: cost_draft << cost_verify (lightweight head), so even moderate acceptance rates yield speedup
- N-gram: cost_draft ~= 0, so any acceptance rate > 0 yields speedup
- Draft model: cost_draft is non-trivial, requiring higher acceptance rates
Greedy Decoding Equivalence
For greedy decoding (temperature=0), the verification simplifies to:
for i in 1..K:
if argmax(p_target(position_i)) == t_i:
accept t_i
else:
emit argmax(p_target(position_i)) as corrective token
break
This produces bit-identical output because the argmax operation is deterministic. The output is exactly what the target model would have produced through standard autoregressive decoding.