Principle:Vllm project Vllm Speculative Method Selection
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Speculative Decoding, Performance Optimization |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Selecting a speculative decoding strategy determines how draft tokens are proposed and verified during autoregressive generation, trading off between model complexity, memory usage, and acceptance rate.
Description
Speculative decoding accelerates LLM inference by generating multiple candidate tokens cheaply (the "draft" phase) and then verifying them in a single forward pass of the target model (the "verify" phase). The choice of speculation method dictates how the draft tokens are produced:
- EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency): Uses a lightweight head network trained on the target model's hidden states to predict future tokens. EAGLE leverages feature-level extrapolation rather than token-level prediction, achieving high acceptance rates with minimal overhead. The EAGLE head is a small neural network (typically a single transformer layer) that takes the target model's hidden states and predicts the next token distribution.
- EAGLE3: An improved variant of EAGLE that returns intermediate hidden states in addition to the final hidden state, enabling more accurate multi-step speculation. EAGLE3 achieves higher acceptance rates than EAGLE on the same models.
- N-gram (Prompt Lookup): A model-free approach that proposes tokens by searching for n-gram matches in the existing prompt/context. When the model is generating text similar to input content (e.g., summarization, code completion with context), n-gram matching can achieve good acceptance rates with zero additional memory or compute cost.
- MTP (Multi-Token Prediction): Uses the target model's own multi-token prediction heads, which must be natively supported by the model architecture. Models like DeepSeek-V3 include MTP heads that were trained alongside the main model, providing high-quality draft predictions at no additional download cost.
- Draft Model: Uses a separate, smaller language model from the same model family as the draft proposer. The draft model generates tokens autoregressively, and they are verified in parallel by the target model. This is the classical speculative decoding approach.
Usage
Use this principle when deciding which speculation method to adopt for a given deployment scenario:
- Choose eagle or eagle3 when EAGLE checkpoints are available for the target model and GPU memory permits loading additional weights. EAGLE3 is preferred when available as it typically achieves higher acceptance rates.
- Choose ngram when no additional model weights are available, when GPU memory is constrained, or when the generation task involves producing text similar to the input (summarization, document editing, code completion).
- Choose mtp when the target model natively supports multi-token prediction heads (e.g., DeepSeek-V3).
- Choose draft_model when a smaller model from the same family is available (e.g., using a 1B parameter model to draft for an 8B parameter model).
Theoretical Basis
Speculative decoding is grounded in the observation that verifying K candidate tokens can be done in a single forward pass of the target model, whereas generating those K tokens autoregressively would require K separate forward passes. The key theoretical guarantee is that with proper rejection sampling, the output distribution is mathematically identical to the original model's distribution.
Draft-then-Verify Paradigm
The speculative decoding loop proceeds as follows:
- The draft mechanism proposes K candidate tokens: t_1, t_2, ..., t_K
- The target model processes all K tokens in a single forward pass, computing probabilities p(t_i | t_1, ..., t_{i-1}) for each position
- Each candidate token is accepted or rejected using a comparison between the draft probability q(t_i) and target probability p(t_i)
- Upon first rejection at position j, a corrective token is sampled from an adjusted distribution
- All tokens after position j are discarded
Acceptance Probability
For each candidate token t at position i, the acceptance probability is:
accept_prob = min(1, p(t_i) / q(t_i))
where p(t_i) is the target model's probability and q(t_i) is the draft model's probability. This ensures the final distribution exactly matches the target model's distribution (for greedy decoding, this simplifies to checking if the argmax matches).
Expected Speedup
The expected number of accepted tokens per verification step is:
E[accepted] = sum_{i=1}^{K} prod_{j=1}^{i} alpha_j
where alpha_j is the acceptance rate at position j. The wall-clock speedup depends on the ratio of draft cost to verification cost and the acceptance rate. Higher acceptance rates and cheaper draft mechanisms yield greater speedups.
Method-Specific Tradeoffs
| Method | Extra Model | Memory Overhead | Typical Acceptance Rate | Best For |
|---|---|---|---|---|
| EAGLE/EAGLE3 | Small head network | Low-Medium | High (70-85%) | General-purpose |
| N-gram | None | Zero | Variable (depends on task) | Summarization, code completion |
| MTP | None (built-in) | Zero | High | Models with native MTP support |
| Draft Model | Full smaller model | High | Medium-High | When smaller model exists |