Principle:Vllm project Vllm Speculative Decoding Configuration
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Speculative Decoding, Configuration |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Speculative decoding configuration is the process of assembling a parameter dictionary that fully describes the draft-then-verify strategy, including method, model path, token count, and method-specific settings.
Description
The speculative decoding configuration dictionary is the central data structure that connects the user's strategic choices (which method, how many tokens to speculate) with the engine's internal machinery. It is a plain Python dictionary that is passed to the LLM constructor via the speculative_config keyword argument. The engine parses this dictionary into a SpeculativeConfig dataclass that validates parameters, resolves model paths, and initializes the draft model infrastructure.
The configuration dictionary serves as the single source of truth for speculative decoding behavior. It must contain a method key and a num_speculative_tokens key at minimum. Depending on the method, additional keys such as model (for EAGLE, EAGLE3, and draft model methods) or prompt_lookup_max/prompt_lookup_min (for n-gram) are required.
Usage
Use this principle when constructing the speculative decoding configuration before passing it to the vLLM engine. Understanding the configuration schema ensures that:
- All required parameters for the chosen method are provided
- Optional tuning parameters (e.g.,
num_speculative_tokens) are set appropriately for the workload - Method-specific constraints are satisfied (e.g., n-gram requires
prompt_lookup_max)
Theoretical Basis
Configuration as Hyperparameter Space
The speculative decoding configuration defines a hyperparameter space that controls the tradeoff between speculation aggressiveness and overhead:
- num_speculative_tokens (K): The number of draft tokens to generate per speculation round. Higher K means more potential speedup but lower acceptance rates at later positions (due to compounding probability of acceptance). The optimal K depends on the acceptance rate profile and the relative cost of drafting versus verification.
Expected speedup factor ~= (1 + E[accepted]) / (cost_draft * K + cost_verify)
Where:
K = num_speculative_tokens
E[accepted] = expected number of accepted tokens
cost_draft = cost per draft token (method-dependent)
cost_verify = cost of one target model forward pass (verifies K tokens)
- prompt_lookup_max / prompt_lookup_min: For n-gram methods, these control the window size for matching. A larger window increases the chance of finding longer matches but costs more search time. The minimum window prevents very short (low-quality) matches.
- model: The draft model or EAGLE head path. For EAGLE methods, a head trained on the specific target model will have much higher acceptance rates than a mismatched head. For draft model methods, a model from the same family and training data produces better drafts.
Parameter Validation
The configuration undergoes validation when the SpeculativeConfig dataclass is instantiated:
num_speculative_tokensmust be a positive integer (enforced byField(gt=0))- If
methodis"ngram", thenprompt_lookup_maxmust be provided - If
methodis"eagle","eagle3", or"draft_model", thenmodelmust be provided - If
methodis"mtp", the target model must natively support multi-token prediction heads draft_tensor_parallel_size, if provided, must be 1 or equal to the target model's tensor parallel size