Implementation:Vllm project Vllm LLM Generate

Knowledge Sources	vLLM vLLM Docs
Domains	Machine Learning, Natural Language Processing, High-Performance Computing
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for batch text generation from prompts provided by vLLM.

Description

LLM.generate() is the primary method for offline text generation in vLLM. It accepts one or more prompts along with sampling parameters, submits them all to the internal LLM engine, and returns the generated completions. The method automatically handles batching, memory management, and scheduling through the engine's continuous batching infrastructure.

The method validates that the model is a generative model (runner_type == "generate"), resolves default sampling parameters if none are provided, and delegates to the internal _run_completion() method which manages the request lifecycle. A tqdm progress bar is shown by default to indicate generation progress.

Prompts can be provided as:

A single string
A list of strings
A list of TokensPrompt objects (pre-tokenized inputs)
A list of TextPrompt or other PromptType variants

Usage

Call LLM.generate() after initializing an LLM instance and configuring SamplingParams. For best throughput, pass all prompts in a single call rather than looping with one prompt at a time.

Code Reference

Source Location

Repository: vllm
File: vllm/entrypoints/llm.py
Lines: 396-459

Signature

def generate(
    self,
    prompts: PromptType | Sequence[PromptType],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    *,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: list[LoRARequest] | LoRARequest | None = None,
    priority: list[int] | None = None,
    tokenization_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]

Import

from vllm import LLM

I/O Contract

Inputs

Name	Type	Required	Description
prompts	PromptType or Sequence[PromptType]	Yes	One or more prompts. Can be strings, TokensPrompt, TextPrompt, or other PromptType variants
sampling_params	SamplingParams, Sequence[SamplingParams], or None	No (default: None)	Sampling configuration. A single instance applies to all prompts; a list pairs one-to-one with prompts. None uses model defaults
use_tqdm	bool or Callable	No (default: True)	If True, shows a tqdm progress bar. If a callable, used to create a custom progress bar. If False, no progress bar
lora_request	LoRARequest, list[LoRARequest], or None	No (default: None)	LoRA adapter(s) to apply during generation
priority	list[int] or None	No (default: None)	Per-prompt priority values for priority scheduling. Must match the length of prompts
tokenization_kwargs	dict[str, Any] or None	No (default: None)	Additional keyword arguments passed to the tokenizer's encode method

Outputs

Name	Type	Description
results	list[RequestOutput]	A list of RequestOutput objects in the same order as the input prompts. Each RequestOutput contains one or more CompletionOutput objects (depending on the `n` parameter in SamplingParams)

Usage Examples

Single Prompt Generation

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

output = llm.generate("The future of AI is", params)
print(output[0].outputs[0].text)

Batch Generation with Multiple Prompts

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=128)

prompts = [
    "Explain the theory of relativity in simple terms:",
    "Write a haiku about programming:",
    "What are the three laws of thermodynamics?",
    "Describe the process of photosynthesis:",
]

outputs = llm.generate(prompts, params)
for i, output in enumerate(outputs):
    print(f"Prompt {i}: {output.outputs[0].text}\n")

Per-Prompt Sampling Parameters

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

prompts = ["Summarize quantum computing:", "Write a creative story about:"]
params_list = [
    SamplingParams(temperature=0.1, max_tokens=100),   # Factual: low temp
    SamplingParams(temperature=0.9, max_tokens=300),    # Creative: high temp
]

outputs = llm.generate(prompts, params_list)
for output in outputs:
    print(output.outputs[0].text)

Generation without Progress Bar

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(max_tokens=64)

# Suppress the tqdm progress bar
outputs = llm.generate(
    ["Hello, world!"],
    params,
    use_tqdm=False,
)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Batch_Text_Generation

Requires Environment

Environment:Vllm_project_Vllm_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment