Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Vllm project Vllm Speculative Decoding

From Leeroopedia
Revision as of 11:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Vllm_project_Vllm_Speculative_Decoding.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains LLMs, Inference, Speculative_Decoding, Performance_Optimization
Last Updated 2026-02-08 13:00 GMT

Overview

End-to-end process for accelerating LLM inference using speculative decoding techniques (EAGLE, n-gram, MTP, draft model) with vLLM.

Description

This workflow covers configuring and running speculative decoding, a technique that uses a fast draft model or heuristic to propose multiple candidate tokens that are then verified by the target model in parallel. When candidates are accepted, multiple tokens are generated per forward pass, significantly reducing latency. vLLM supports several speculative decoding methods: EAGLE (learned draft heads), EAGLE3, n-gram (prompt lookup), MTP (multi-token prediction), and traditional draft model approaches.

Usage

Execute this workflow when you need to reduce per-request latency for interactive applications. Speculative decoding is most effective for latency-sensitive use cases like chatbots, code completion, and real-time text generation where time-to-completion matters more than raw throughput. It is especially beneficial when the target model is large and the draft model or heuristic can predict tokens with reasonable accuracy.

Execution Steps

Step 1: Select Speculative Method

Choose a speculative decoding method based on the target model, available resources, and latency goals. Each method has different trade-offs in setup complexity, memory overhead, and acceptance rate.

Methods available:

  • EAGLE/EAGLE3: Learned draft head trained on the target model's hidden states. Highest acceptance rates, requires a compatible EAGLE checkpoint.
  • N-gram: Prompt lookup heuristic that reuses n-grams from the input prompt. Zero additional memory, no extra model needed, works well for repetitive text.
  • MTP (Multi-Token Prediction): Uses the model's own multi-token prediction heads. Requires a model trained with MTP support.
  • Draft Model: Uses a smaller model from the same family as a proposer. Flexible but requires loading a second model.

Key considerations:

  • EAGLE provides the best acceptance rates but needs a separate checkpoint
  • N-gram is the easiest to set up (no extra model required)
  • Draft model is the most flexible but uses the most additional memory
  • The optimal num_speculative_tokens varies by method and workload

Step 2: Obtain Draft Model or Checkpoint

For EAGLE/EAGLE3 methods, download the corresponding EAGLE checkpoint from HuggingFace. For draft model method, select a smaller model from the same family. N-gram and MTP methods do not require additional model downloads.

Key considerations:

  • EAGLE checkpoints must match the target model architecture
  • Draft models should share the same tokenizer as the target model
  • Smaller draft models trade accuracy for speed
  • Verify checkpoint compatibility before deployment

Step 3: Configure Speculative Decoding

Build the speculative_config dictionary with the chosen method, model path, and tuning parameters. The number of speculative tokens controls how many candidates are proposed per step.

Key considerations:

  • num_speculative_tokens typically ranges from 2-5 tokens
  • Higher values increase potential speedup but lower acceptance rate
  • enable_chunked_prefill can improve throughput with speculative decoding
  • enforce_eager may be needed for debugging but reduces performance

Step 4: Initialize Engine with Speculation

Create the LLM instance with the speculative_config parameter. The engine loads both the target model and the draft model/head, setting up the verification pipeline.

Key considerations:

  • GPU memory must accommodate both target and draft models
  • tensor_parallel_size applies to the target model
  • The draft model runs on the same GPU(s) as the target
  • gpu_memory_utilization may need adjustment for the additional memory

Step 5: Run Speculative Generation

Submit prompts for generation. The engine transparently handles the draft-then-verify loop: the draft model proposes candidate tokens, the target model verifies them in a single forward pass, and accepted tokens are committed to the output.

Key considerations:

  • The output is mathematically identical to standard generation (no quality loss)
  • Temperature 0 (greedy) typically yields higher acceptance rates
  • Batch size affects the effectiveness of speculation
  • Metrics are available to track acceptance rates per position

Step 6: Evaluate Acceptance Metrics

Analyze speculative decoding metrics to assess effectiveness. Key metrics include mean acceptance length, acceptance rate per token position, and the number of draft/accepted tokens.

Key considerations:

  • Mean acceptance length > 1 indicates speedup over standard decoding
  • Per-position acceptance rates reveal how many speculative tokens are optimal
  • Metrics are available via llm.get_metrics() or the Prometheus endpoint
  • Adjust num_speculative_tokens based on observed acceptance patterns

Execution Diagram

GitHub URL

Workflow Repository