Implementation:Vllm project Vllm LLM Generate Speculative
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Speculative Decoding, Text Generation |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for generating text completions with speculative decoding transparently enabled provided by vLLM.
Description
The LLM.generate() method produces text completions for one or more input prompts. When the LLM instance was initialized with a speculative_config, the generation transparently uses the draft-then-verify loop to accelerate inference. The method signature, input format, and output format are identical to non-speculative generation. This means applications can enable or disable speculative decoding purely through engine configuration without any changes to the generation code path.
Internally, the method delegates to _run_completion(), which submits requests to the engine. The engine's scheduler orchestrates the draft and verify phases, returning RequestOutput objects that contain the generated token sequences and associated metadata.
Usage
Use this method to generate text completions with a speculative decoding-enabled engine. The API is the same as standard (non-speculative) generation. Call it with prompts and sampling parameters after initializing the LLM with a speculative_config.
Code Reference
Source Location
- Repository: vllm
- File:
vllm/entrypoints/llm.py:L396-459
Signature
def generate(
self,
prompts: PromptType | Sequence[PromptType],
sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
*,
use_tqdm: bool | Callable[..., tqdm] = True,
lora_request: list[LoRARequest] | LoRARequest | None = None,
priority: list[int] | None = None,
tokenization_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]:
"""Generates the completions for the input prompts."""
...
Import
from vllm import LLM, SamplingParams
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompts | PromptType or Sequence[PromptType] |
Yes | One or more input prompts. Can be strings, token ID lists, or dictionaries with prompt_token_ids and optional multi_modal_data.
|
| sampling_params | SamplingParams or Sequence[SamplingParams] or None |
No | Sampling configuration controlling temperature, top-p, top-k, max_tokens, etc. When None, default sampling parameters are used. When a single value, it is applied to all prompts. When a list, it must match the length of prompts.
|
| use_tqdm | bool or Callable |
No | Controls progress bar display. Defaults to True.
|
| lora_request | list[LoRARequest] or LoRARequest or None |
No | LoRA adapter request(s) for generation. Defaults to None.
|
| priority | list[int] or None |
No | Request priorities for priority scheduling. Defaults to None.
|
| tokenization_kwargs | dict[str, Any] or None |
No | Overrides for the tokenizer's encode method. Defaults to None.
|
Outputs
| Name | Type | Description |
|---|---|---|
| outputs | list[RequestOutput] |
A list of RequestOutput objects in the same order as the input prompts. Each contains the generated token sequences, text, and metadata. The output is mathematically equivalent to non-speculative generation.
|
Usage Examples
Basic Speculative Generation
from vllm import LLM, SamplingParams
# Engine initialized with speculative config (e.g., EAGLE)
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
trust_remote_code=True,
speculative_config={
"method": "eagle",
"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
},
)
# generate() API is identical to non-speculative usage
sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(
["Explain the theory of relativity in simple terms."],
sampling_params=sampling_params,
)
for output in outputs:
print(output.outputs[0].text)
Batch Generation with Token IDs
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(
model=model_name,
speculative_config={
"method": "ngram",
"num_speculative_tokens": 3,
"prompt_lookup_max": 5,
"prompt_lookup_min": 2,
},
)
prompts = [
"Summarize the following: The quick brown fox...",
"What is machine learning?",
"Write a haiku about spring.",
]
# Encode prompts as token IDs
llm_prompts = [
{"prompt_token_ids": tokenizer.encode(p, add_special_tokens=False)}
for p in prompts
]
sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(llm_prompts, sampling_params=sampling_params)
for i, output in enumerate(outputs):
print(f"Prompt: {prompts[i]}")
print(f"Output: {output.outputs[0].text}")
print("-" * 50)
Greedy vs. Sampling
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_config={
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
},
)
prompt = ["What are the benefits of exercise?"]
# Greedy decoding: bit-identical to non-speculative output
greedy_params = SamplingParams(temperature=0, max_tokens=128)
greedy_output = llm.generate(prompt, greedy_params)
# Sampling: same distribution as non-speculative output
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128)
sampled_output = llm.generate(prompt, sampling_params)