Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Speculative Config Dict

From Leeroopedia


Knowledge Sources
Domains LLM Inference, Speculative Decoding, Configuration
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for defining the speculative decoding configuration dictionary provided by vLLM.

Description

The speculative_config is a plain Python dictionary that fully describes the speculative decoding setup. It is passed to the LLM constructor and internally converted to a SpeculativeConfig dataclass via EngineArgs.create_speculative_config(). The dictionary keys map directly to the fields of SpeculativeConfig defined in vllm/config/speculative.py. Required keys vary by method, but method and num_speculative_tokens are always needed.

Usage

Use this configuration dictionary whenever enabling speculative decoding with vLLM. Construct the dictionary with method-appropriate keys before passing it to the LLM constructor.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/engine/arg_utils.py:L512 (speculative_config parameter definition), vllm/engine/arg_utils.py:L1353-1378 (create_speculative_config method), vllm/config/speculative.py (SpeculativeConfig dataclass)

Signature

# The speculative_config dict schema (passed to LLM constructor)
speculative_config = {
    "method": str,                    # Required: "eagle", "eagle3", "ngram", "mtp", "draft_model"
    "model": str | None,              # Required for eagle/eagle3/draft_model; path or Hub ID
    "num_speculative_tokens": int,    # Required: number of draft tokens per round (typically 2-5)
    "prompt_lookup_max": int | None,  # Required for ngram: max n-gram window size
    "prompt_lookup_min": int | None,  # Optional for ngram: min n-gram window size (default 1)
    # Optional advanced parameters:
    "enforce_eager": bool | None,              # Override eager mode for draft model
    "max_model_len": int | None,               # Max sequence length for draft model
    "draft_tensor_parallel_size": int | None,  # TP degree for draft model
    "quantization": str | None,                # Quantization for draft model weights
    "revision": str | None,                    # Model revision for draft model
    "disable_padded_drafter_batch": bool,      # Disable input padding (EAGLE only)
    "parallel_drafting": bool,                 # Enable parallel drafting (EAGLE/draft_model)
}

# Internal conversion in EngineArgs
def create_speculative_config(
    self,
    target_model_config: ModelConfig,
    target_parallel_config: ParallelConfig,
) -> SpeculativeConfig | None:
    ...

Import

# No special import needed for the dict itself.
# The dict is passed to the LLM constructor:
from vllm import LLM

llm = LLM(model="...", speculative_config={...})

I/O Contract

Inputs

Name Type Required Description
method str Yes Speculative decoding strategy: "eagle", "eagle3", "ngram", "mtp", or "draft_model".
model str or None Conditional Path or Hugging Face Hub ID for the EAGLE head or draft model. Required for eagle, eagle3, and draft_model methods.
num_speculative_tokens int Yes Number of candidate tokens to draft per speculation round. Must be a positive integer. Typical values range from 2 to 5.
prompt_lookup_max int or None Conditional Maximum n-gram window size. Required when method="ngram".
prompt_lookup_min int or None No Minimum n-gram window size. Defaults to 1. Only applicable when method="ngram".
enforce_eager bool or None No Override eager execution mode for the draft model. When None, inherits the target model setting.
max_model_len int or None No Maximum sequence length for the draft model. Useful for constraining draft model memory usage.
draft_tensor_parallel_size int or None No Tensor parallel degree for the draft model. Must be 1 or equal to target model's TP size.
disable_padded_drafter_batch bool No Disable input padding for speculative batches. Defaults to False. Only affects EAGLE method.
parallel_drafting bool No Generate all speculative tokens in parallel rather than sequentially. Defaults to False. Compatible with EAGLE and draft model methods.

Outputs

Name Type Description
SpeculativeConfig SpeculativeConfig A validated configuration object created internally by EngineArgs.create_speculative_config(). Contains all speculative decoding parameters plus resolved draft model configuration.

Usage Examples

EAGLE Configuration

speculative_config = {
    "method": "eagle",
    "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
    "num_speculative_tokens": 3,
}

EAGLE3 Configuration with Parallel Drafting

speculative_config = {
    "method": "eagle3",
    "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
    "num_speculative_tokens": 3,
    "disable_padded_drafter_batch": False,
    "parallel_drafting": True,
}

N-gram Configuration

speculative_config = {
    "method": "ngram",
    "num_speculative_tokens": 3,
    "prompt_lookup_max": 5,
    "prompt_lookup_min": 2,
}

MTP Configuration

# For models with native MTP support (e.g., DeepSeek-V3)
speculative_config = {
    "method": "mtp",
    "num_speculative_tokens": 2,
}

Draft Model Configuration

speculative_config = {
    "method": "draft_model",
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "num_speculative_tokens": 3,
    "enforce_eager": True,
    "max_model_len": 16384,
    "parallel_drafting": False,
}

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment