Implementation:Vllm project Vllm Speculative Config Dict

Knowledge Sources	vLLM vLLM Docs
Domains	LLM Inference, Speculative Decoding, Configuration
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for defining the speculative decoding configuration dictionary provided by vLLM.

Description

The speculative_config is a plain Python dictionary that fully describes the speculative decoding setup. It is passed to the LLM constructor and internally converted to a SpeculativeConfig dataclass via EngineArgs.create_speculative_config(). The dictionary keys map directly to the fields of SpeculativeConfig defined in vllm/config/speculative.py. Required keys vary by method, but method and num_speculative_tokens are always needed.

Usage

Use this configuration dictionary whenever enabling speculative decoding with vLLM. Construct the dictionary with method-appropriate keys before passing it to the LLM constructor.

Code Reference

Source Location

Repository: vllm
File: vllm/engine/arg_utils.py:L512 (speculative_config parameter definition), vllm/engine/arg_utils.py:L1353-1378 (create_speculative_config method), vllm/config/speculative.py (SpeculativeConfig dataclass)

Signature

# The speculative_config dict schema (passed to LLM constructor)
speculative_config = {
    "method": str,                    # Required: "eagle", "eagle3", "ngram", "mtp", "draft_model"
    "model": str | None,              # Required for eagle/eagle3/draft_model; path or Hub ID
    "num_speculative_tokens": int,    # Required: number of draft tokens per round (typically 2-5)
    "prompt_lookup_max": int | None,  # Required for ngram: max n-gram window size
    "prompt_lookup_min": int | None,  # Optional for ngram: min n-gram window size (default 1)
    # Optional advanced parameters:
    "enforce_eager": bool | None,              # Override eager mode for draft model
    "max_model_len": int | None,               # Max sequence length for draft model
    "draft_tensor_parallel_size": int | None,  # TP degree for draft model
    "quantization": str | None,                # Quantization for draft model weights
    "revision": str | None,                    # Model revision for draft model
    "disable_padded_drafter_batch": bool,      # Disable input padding (EAGLE only)
    "parallel_drafting": bool,                 # Enable parallel drafting (EAGLE/draft_model)
}

# Internal conversion in EngineArgs
def create_speculative_config(
    self,
    target_model_config: ModelConfig,
    target_parallel_config: ParallelConfig,
) -> SpeculativeConfig | None:
    ...

Import

# No special import needed for the dict itself.
# The dict is passed to the LLM constructor:
from vllm import LLM

llm = LLM(model="...", speculative_config={...})

I/O Contract

Inputs

Name	Type	Required	Description
method	`str`	Yes	Speculative decoding strategy: `"eagle"`, `"eagle3"`, `"ngram"`, `"mtp"`, or `"draft_model"`.
model	`str or None`	Conditional	Path or Hugging Face Hub ID for the EAGLE head or draft model. Required for `eagle`, `eagle3`, and `draft_model` methods.
num_speculative_tokens	`int`	Yes	Number of candidate tokens to draft per speculation round. Must be a positive integer. Typical values range from 2 to 5.
prompt_lookup_max	`int or None`	Conditional	Maximum n-gram window size. Required when `method="ngram"`.
prompt_lookup_min	`int or None`	No	Minimum n-gram window size. Defaults to 1. Only applicable when `method="ngram"`.
enforce_eager	`bool or None`	No	Override eager execution mode for the draft model. When `None`, inherits the target model setting.
max_model_len	`int or None`	No	Maximum sequence length for the draft model. Useful for constraining draft model memory usage.
draft_tensor_parallel_size	`int or None`	No	Tensor parallel degree for the draft model. Must be 1 or equal to target model's TP size.
disable_padded_drafter_batch	`bool`	No	Disable input padding for speculative batches. Defaults to `False`. Only affects EAGLE method.
parallel_drafting	`bool`	No	Generate all speculative tokens in parallel rather than sequentially. Defaults to `False`. Compatible with EAGLE and draft model methods.

Outputs

Name	Type	Description
SpeculativeConfig	`SpeculativeConfig`	A validated configuration object created internally by `EngineArgs.create_speculative_config()`. Contains all speculative decoding parameters plus resolved draft model configuration.

Usage Examples

EAGLE Configuration

speculative_config = {
    "method": "eagle",
    "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
    "num_speculative_tokens": 3,
}

EAGLE3 Configuration with Parallel Drafting

speculative_config = {
    "method": "eagle3",
    "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
    "num_speculative_tokens": 3,
    "disable_padded_drafter_batch": False,
    "parallel_drafting": True,
}

N-gram Configuration

speculative_config = {
    "method": "ngram",
    "num_speculative_tokens": 3,
    "prompt_lookup_max": 5,
    "prompt_lookup_min": 2,
}

MTP Configuration

# For models with native MTP support (e.g., DeepSeek-V3)
speculative_config = {
    "method": "mtp",
    "num_speculative_tokens": 2,
}

Draft Model Configuration

speculative_config = {
    "method": "draft_model",
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "num_speculative_tokens": 3,
    "enforce_eager": True,
    "max_model_len": 16384,
    "parallel_drafting": False,
}

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Speculative_Decoding_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment