Implementation:Vllm project Vllm LLM Generate Speculative

Knowledge Sources	vLLM vLLM Docs
Domains	LLM Inference, Speculative Decoding, Text Generation
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for generating text completions with speculative decoding transparently enabled provided by vLLM.

Description

The LLM.generate() method produces text completions for one or more input prompts. When the LLM instance was initialized with a speculative_config, the generation transparently uses the draft-then-verify loop to accelerate inference. The method signature, input format, and output format are identical to non-speculative generation. This means applications can enable or disable speculative decoding purely through engine configuration without any changes to the generation code path.

Internally, the method delegates to _run_completion(), which submits requests to the engine. The engine's scheduler orchestrates the draft and verify phases, returning RequestOutput objects that contain the generated token sequences and associated metadata.

Usage

Use this method to generate text completions with a speculative decoding-enabled engine. The API is the same as standard (non-speculative) generation. Call it with prompts and sampling parameters after initializing the LLM with a speculative_config.

Code Reference

Source Location

Repository: vllm
File: vllm/entrypoints/llm.py:L396-459

Signature

def generate(
    self,
    prompts: PromptType | Sequence[PromptType],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    *,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: list[LoRARequest] | LoRARequest | None = None,
    priority: list[int] | None = None,
    tokenization_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]:
    """Generates the completions for the input prompts."""
    ...

Import

from vllm import LLM, SamplingParams

I/O Contract

Inputs

Name	Type	Required	Description
prompts	`PromptType or Sequence[PromptType]`	Yes	One or more input prompts. Can be strings, token ID lists, or dictionaries with `prompt_token_ids` and optional `multi_modal_data`.
sampling_params	`SamplingParams or Sequence[SamplingParams] or None`	No	Sampling configuration controlling temperature, top-p, top-k, max_tokens, etc. When `None`, default sampling parameters are used. When a single value, it is applied to all prompts. When a list, it must match the length of prompts.
use_tqdm	`bool or Callable`	No	Controls progress bar display. Defaults to `True`.
lora_request	`list[LoRARequest] or LoRARequest or None`	No	LoRA adapter request(s) for generation. Defaults to `None`.
priority	`list[int] or None`	No	Request priorities for priority scheduling. Defaults to `None`.
tokenization_kwargs	`dict[str, Any] or None`	No	Overrides for the tokenizer's encode method. Defaults to `None`.

Outputs

Name	Type	Description
outputs	`list[RequestOutput]`	A list of `RequestOutput` objects in the same order as the input prompts. Each contains the generated token sequences, text, and metadata. The output is mathematically equivalent to non-speculative generation.

Usage Examples

Basic Speculative Generation

from vllm import LLM, SamplingParams

# Engine initialized with speculative config (e.g., EAGLE)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    trust_remote_code=True,
    speculative_config={
        "method": "eagle",
        "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 3,
    },
)

# generate() API is identical to non-speculative usage
sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(
    ["Explain the theory of relativity in simple terms."],
    sampling_params=sampling_params,
)

for output in outputs:
    print(output.outputs[0].text)

Batch Generation with Token IDs

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

llm = LLM(
    model=model_name,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 3,
        "prompt_lookup_max": 5,
        "prompt_lookup_min": 2,
    },
)

prompts = [
    "Summarize the following: The quick brown fox...",
    "What is machine learning?",
    "Write a haiku about spring.",
]

# Encode prompts as token IDs
llm_prompts = [
    {"prompt_token_ids": tokenizer.encode(p, add_special_tokens=False)}
    for p in prompts
]

sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(llm_prompts, sampling_params=sampling_params)

for i, output in enumerate(outputs):
    print(f"Prompt: {prompts[i]}")
    print(f"Output: {output.outputs[0].text}")
    print("-" * 50)

Greedy vs. Sampling

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "method": "eagle3",
        "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 3,
    },
)

prompt = ["What are the benefits of exercise?"]

# Greedy decoding: bit-identical to non-speculative output
greedy_params = SamplingParams(temperature=0, max_tokens=128)
greedy_output = llm.generate(prompt, greedy_params)

# Sampling: same distribution as non-speculative output
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128)
sampled_output = llm.generate(prompt, sampling_params)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Speculative_Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment