Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm LLM Generate Speculative

From Leeroopedia


Knowledge Sources
Domains LLM Inference, Speculative Decoding, Text Generation
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for generating text completions with speculative decoding transparently enabled provided by vLLM.

Description

The LLM.generate() method produces text completions for one or more input prompts. When the LLM instance was initialized with a speculative_config, the generation transparently uses the draft-then-verify loop to accelerate inference. The method signature, input format, and output format are identical to non-speculative generation. This means applications can enable or disable speculative decoding purely through engine configuration without any changes to the generation code path.

Internally, the method delegates to _run_completion(), which submits requests to the engine. The engine's scheduler orchestrates the draft and verify phases, returning RequestOutput objects that contain the generated token sequences and associated metadata.

Usage

Use this method to generate text completions with a speculative decoding-enabled engine. The API is the same as standard (non-speculative) generation. Call it with prompts and sampling parameters after initializing the LLM with a speculative_config.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/entrypoints/llm.py:L396-459

Signature

def generate(
    self,
    prompts: PromptType | Sequence[PromptType],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    *,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: list[LoRARequest] | LoRARequest | None = None,
    priority: list[int] | None = None,
    tokenization_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]:
    """Generates the completions for the input prompts."""
    ...

Import

from vllm import LLM, SamplingParams

I/O Contract

Inputs

Name Type Required Description
prompts PromptType or Sequence[PromptType] Yes One or more input prompts. Can be strings, token ID lists, or dictionaries with prompt_token_ids and optional multi_modal_data.
sampling_params SamplingParams or Sequence[SamplingParams] or None No Sampling configuration controlling temperature, top-p, top-k, max_tokens, etc. When None, default sampling parameters are used. When a single value, it is applied to all prompts. When a list, it must match the length of prompts.
use_tqdm bool or Callable No Controls progress bar display. Defaults to True.
lora_request list[LoRARequest] or LoRARequest or None No LoRA adapter request(s) for generation. Defaults to None.
priority list[int] or None No Request priorities for priority scheduling. Defaults to None.
tokenization_kwargs dict[str, Any] or None No Overrides for the tokenizer's encode method. Defaults to None.

Outputs

Name Type Description
outputs list[RequestOutput] A list of RequestOutput objects in the same order as the input prompts. Each contains the generated token sequences, text, and metadata. The output is mathematically equivalent to non-speculative generation.

Usage Examples

Basic Speculative Generation

from vllm import LLM, SamplingParams

# Engine initialized with speculative config (e.g., EAGLE)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    trust_remote_code=True,
    speculative_config={
        "method": "eagle",
        "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 3,
    },
)

# generate() API is identical to non-speculative usage
sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(
    ["Explain the theory of relativity in simple terms."],
    sampling_params=sampling_params,
)

for output in outputs:
    print(output.outputs[0].text)

Batch Generation with Token IDs

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

llm = LLM(
    model=model_name,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 3,
        "prompt_lookup_max": 5,
        "prompt_lookup_min": 2,
    },
)

prompts = [
    "Summarize the following: The quick brown fox...",
    "What is machine learning?",
    "Write a haiku about spring.",
]

# Encode prompts as token IDs
llm_prompts = [
    {"prompt_token_ids": tokenizer.encode(p, add_special_tokens=False)}
    for p in prompts
]

sampling_params = SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(llm_prompts, sampling_params=sampling_params)

for i, output in enumerate(outputs):
    print(f"Prompt: {prompts[i]}")
    print(f"Output: {output.outputs[0].text}")
    print("-" * 50)

Greedy vs. Sampling

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "method": "eagle3",
        "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 3,
    },
)

prompt = ["What are the benefits of exercise?"]

# Greedy decoding: bit-identical to non-speculative output
greedy_params = SamplingParams(temperature=0, max_tokens=128)
greedy_output = llm.generate(prompt, greedy_params)

# Sampling: same distribution as non-speculative output
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128)
sampled_output = llm.generate(prompt, sampling_params)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment