Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm LLM Generate

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Natural Language Processing, High-Performance Computing
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for batch text generation from prompts provided by vLLM.

Description

LLM.generate() is the primary method for offline text generation in vLLM. It accepts one or more prompts along with sampling parameters, submits them all to the internal LLM engine, and returns the generated completions. The method automatically handles batching, memory management, and scheduling through the engine's continuous batching infrastructure.

The method validates that the model is a generative model (runner_type == "generate"), resolves default sampling parameters if none are provided, and delegates to the internal _run_completion() method which manages the request lifecycle. A tqdm progress bar is shown by default to indicate generation progress.

Prompts can be provided as:

  • A single string
  • A list of strings
  • A list of TokensPrompt objects (pre-tokenized inputs)
  • A list of TextPrompt or other PromptType variants

Usage

Call LLM.generate() after initializing an LLM instance and configuring SamplingParams. For best throughput, pass all prompts in a single call rather than looping with one prompt at a time.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/entrypoints/llm.py
  • Lines: 396-459

Signature

def generate(
    self,
    prompts: PromptType | Sequence[PromptType],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    *,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: list[LoRARequest] | LoRARequest | None = None,
    priority: list[int] | None = None,
    tokenization_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]

Import

from vllm import LLM

I/O Contract

Inputs

Name Type Required Description
prompts PromptType or Sequence[PromptType] Yes One or more prompts. Can be strings, TokensPrompt, TextPrompt, or other PromptType variants
sampling_params SamplingParams, Sequence[SamplingParams], or None No (default: None) Sampling configuration. A single instance applies to all prompts; a list pairs one-to-one with prompts. None uses model defaults
use_tqdm bool or Callable No (default: True) If True, shows a tqdm progress bar. If a callable, used to create a custom progress bar. If False, no progress bar
lora_request LoRARequest, list[LoRARequest], or None No (default: None) LoRA adapter(s) to apply during generation
priority list[int] or None No (default: None) Per-prompt priority values for priority scheduling. Must match the length of prompts
tokenization_kwargs dict[str, Any] or None No (default: None) Additional keyword arguments passed to the tokenizer's encode method

Outputs

Name Type Description
results list[RequestOutput] A list of RequestOutput objects in the same order as the input prompts. Each RequestOutput contains one or more CompletionOutput objects (depending on the n parameter in SamplingParams)

Usage Examples

Single Prompt Generation

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

output = llm.generate("The future of AI is", params)
print(output[0].outputs[0].text)

Batch Generation with Multiple Prompts

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=128)

prompts = [
    "Explain the theory of relativity in simple terms:",
    "Write a haiku about programming:",
    "What are the three laws of thermodynamics?",
    "Describe the process of photosynthesis:",
]

outputs = llm.generate(prompts, params)
for i, output in enumerate(outputs):
    print(f"Prompt {i}: {output.outputs[0].text}\n")

Per-Prompt Sampling Parameters

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

prompts = ["Summarize quantum computing:", "Write a creative story about:"]
params_list = [
    SamplingParams(temperature=0.1, max_tokens=100),   # Factual: low temp
    SamplingParams(temperature=0.9, max_tokens=300),    # Creative: high temp
]

outputs = llm.generate(prompts, params_list)
for output in outputs:
    print(output.outputs[0].text)

Generation without Progress Bar

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(max_tokens=64)

# Suppress the tqdm progress bar
outputs = llm.generate(
    ["Hello, world!"],
    params,
    use_tqdm=False,
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment