Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Vllm project Vllm Speculative Method Selection

From Leeroopedia


Knowledge Sources
Domains LLM Inference, Speculative Decoding, Performance Optimization
Last Updated 2026-02-08 13:00 GMT

Overview

Selecting a speculative decoding strategy determines how draft tokens are proposed and verified during autoregressive generation, trading off between model complexity, memory usage, and acceptance rate.

Description

Speculative decoding accelerates LLM inference by generating multiple candidate tokens cheaply (the "draft" phase) and then verifying them in a single forward pass of the target model (the "verify" phase). The choice of speculation method dictates how the draft tokens are produced:

  • EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency): Uses a lightweight head network trained on the target model's hidden states to predict future tokens. EAGLE leverages feature-level extrapolation rather than token-level prediction, achieving high acceptance rates with minimal overhead. The EAGLE head is a small neural network (typically a single transformer layer) that takes the target model's hidden states and predicts the next token distribution.
  • EAGLE3: An improved variant of EAGLE that returns intermediate hidden states in addition to the final hidden state, enabling more accurate multi-step speculation. EAGLE3 achieves higher acceptance rates than EAGLE on the same models.
  • N-gram (Prompt Lookup): A model-free approach that proposes tokens by searching for n-gram matches in the existing prompt/context. When the model is generating text similar to input content (e.g., summarization, code completion with context), n-gram matching can achieve good acceptance rates with zero additional memory or compute cost.
  • MTP (Multi-Token Prediction): Uses the target model's own multi-token prediction heads, which must be natively supported by the model architecture. Models like DeepSeek-V3 include MTP heads that were trained alongside the main model, providing high-quality draft predictions at no additional download cost.
  • Draft Model: Uses a separate, smaller language model from the same model family as the draft proposer. The draft model generates tokens autoregressively, and they are verified in parallel by the target model. This is the classical speculative decoding approach.

Usage

Use this principle when deciding which speculation method to adopt for a given deployment scenario:

  • Choose eagle or eagle3 when EAGLE checkpoints are available for the target model and GPU memory permits loading additional weights. EAGLE3 is preferred when available as it typically achieves higher acceptance rates.
  • Choose ngram when no additional model weights are available, when GPU memory is constrained, or when the generation task involves producing text similar to the input (summarization, document editing, code completion).
  • Choose mtp when the target model natively supports multi-token prediction heads (e.g., DeepSeek-V3).
  • Choose draft_model when a smaller model from the same family is available (e.g., using a 1B parameter model to draft for an 8B parameter model).

Theoretical Basis

Speculative decoding is grounded in the observation that verifying K candidate tokens can be done in a single forward pass of the target model, whereas generating those K tokens autoregressively would require K separate forward passes. The key theoretical guarantee is that with proper rejection sampling, the output distribution is mathematically identical to the original model's distribution.

Draft-then-Verify Paradigm

The speculative decoding loop proceeds as follows:

  1. The draft mechanism proposes K candidate tokens: t_1, t_2, ..., t_K
  2. The target model processes all K tokens in a single forward pass, computing probabilities p(t_i | t_1, ..., t_{i-1}) for each position
  3. Each candidate token is accepted or rejected using a comparison between the draft probability q(t_i) and target probability p(t_i)
  4. Upon first rejection at position j, a corrective token is sampled from an adjusted distribution
  5. All tokens after position j are discarded

Acceptance Probability

For each candidate token t at position i, the acceptance probability is:

accept_prob = min(1, p(t_i) / q(t_i))

where p(t_i) is the target model's probability and q(t_i) is the draft model's probability. This ensures the final distribution exactly matches the target model's distribution (for greedy decoding, this simplifies to checking if the argmax matches).

Expected Speedup

The expected number of accepted tokens per verification step is:

E[accepted] = sum_{i=1}^{K} prod_{j=1}^{i} alpha_j

where alpha_j is the acceptance rate at position j. The wall-clock speedup depends on the ratio of draft cost to verification cost and the acceptance rate. Higher acceptance rates and cheaper draft mechanisms yield greater speedups.

Method-Specific Tradeoffs

Method Extra Model Memory Overhead Typical Acceptance Rate Best For
EAGLE/EAGLE3 Small head network Low-Medium High (70-85%) General-purpose
N-gram None Zero Variable (depends on task) Summarization, code completion
MTP None (built-in) Zero High Models with native MTP support
Draft Model Full smaller model High Medium-High When smaller model exists

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment