Implementation:Vllm project Vllm Speculative Config Dict
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Speculative Decoding, Configuration |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for defining the speculative decoding configuration dictionary provided by vLLM.
Description
The speculative_config is a plain Python dictionary that fully describes the speculative decoding setup. It is passed to the LLM constructor and internally converted to a SpeculativeConfig dataclass via EngineArgs.create_speculative_config(). The dictionary keys map directly to the fields of SpeculativeConfig defined in vllm/config/speculative.py. Required keys vary by method, but method and num_speculative_tokens are always needed.
Usage
Use this configuration dictionary whenever enabling speculative decoding with vLLM. Construct the dictionary with method-appropriate keys before passing it to the LLM constructor.
Code Reference
Source Location
- Repository: vllm
- File:
vllm/engine/arg_utils.py:L512(speculative_config parameter definition),vllm/engine/arg_utils.py:L1353-1378(create_speculative_config method),vllm/config/speculative.py(SpeculativeConfig dataclass)
Signature
# The speculative_config dict schema (passed to LLM constructor)
speculative_config = {
"method": str, # Required: "eagle", "eagle3", "ngram", "mtp", "draft_model"
"model": str | None, # Required for eagle/eagle3/draft_model; path or Hub ID
"num_speculative_tokens": int, # Required: number of draft tokens per round (typically 2-5)
"prompt_lookup_max": int | None, # Required for ngram: max n-gram window size
"prompt_lookup_min": int | None, # Optional for ngram: min n-gram window size (default 1)
# Optional advanced parameters:
"enforce_eager": bool | None, # Override eager mode for draft model
"max_model_len": int | None, # Max sequence length for draft model
"draft_tensor_parallel_size": int | None, # TP degree for draft model
"quantization": str | None, # Quantization for draft model weights
"revision": str | None, # Model revision for draft model
"disable_padded_drafter_batch": bool, # Disable input padding (EAGLE only)
"parallel_drafting": bool, # Enable parallel drafting (EAGLE/draft_model)
}
# Internal conversion in EngineArgs
def create_speculative_config(
self,
target_model_config: ModelConfig,
target_parallel_config: ParallelConfig,
) -> SpeculativeConfig | None:
...
Import
# No special import needed for the dict itself.
# The dict is passed to the LLM constructor:
from vllm import LLM
llm = LLM(model="...", speculative_config={...})
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| method | str |
Yes | Speculative decoding strategy: "eagle", "eagle3", "ngram", "mtp", or "draft_model".
|
| model | str or None |
Conditional | Path or Hugging Face Hub ID for the EAGLE head or draft model. Required for eagle, eagle3, and draft_model methods.
|
| num_speculative_tokens | int |
Yes | Number of candidate tokens to draft per speculation round. Must be a positive integer. Typical values range from 2 to 5. |
| prompt_lookup_max | int or None |
Conditional | Maximum n-gram window size. Required when method="ngram".
|
| prompt_lookup_min | int or None |
No | Minimum n-gram window size. Defaults to 1. Only applicable when method="ngram".
|
| enforce_eager | bool or None |
No | Override eager execution mode for the draft model. When None, inherits the target model setting.
|
| max_model_len | int or None |
No | Maximum sequence length for the draft model. Useful for constraining draft model memory usage. |
| draft_tensor_parallel_size | int or None |
No | Tensor parallel degree for the draft model. Must be 1 or equal to target model's TP size. |
| disable_padded_drafter_batch | bool |
No | Disable input padding for speculative batches. Defaults to False. Only affects EAGLE method.
|
| parallel_drafting | bool |
No | Generate all speculative tokens in parallel rather than sequentially. Defaults to False. Compatible with EAGLE and draft model methods.
|
Outputs
| Name | Type | Description |
|---|---|---|
| SpeculativeConfig | SpeculativeConfig |
A validated configuration object created internally by EngineArgs.create_speculative_config(). Contains all speculative decoding parameters plus resolved draft model configuration.
|
Usage Examples
EAGLE Configuration
speculative_config = {
"method": "eagle",
"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
}
EAGLE3 Configuration with Parallel Drafting
speculative_config = {
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
"disable_padded_drafter_batch": False,
"parallel_drafting": True,
}
N-gram Configuration
speculative_config = {
"method": "ngram",
"num_speculative_tokens": 3,
"prompt_lookup_max": 5,
"prompt_lookup_min": 2,
}
MTP Configuration
# For models with native MTP support (e.g., DeepSeek-V3)
speculative_config = {
"method": "mtp",
"num_speculative_tokens": 2,
}
Draft Model Configuration
speculative_config = {
"method": "draft_model",
"model": "meta-llama/Llama-3.2-1B-Instruct",
"num_speculative_tokens": 3,
"enforce_eager": True,
"max_model_len": 16384,
"parallel_drafting": False,
}