Implementation:Vllm project Vllm LLM Generate
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, High-Performance Computing |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for batch text generation from prompts provided by vLLM.
Description
LLM.generate() is the primary method for offline text generation in vLLM. It accepts one or more prompts along with sampling parameters, submits them all to the internal LLM engine, and returns the generated completions. The method automatically handles batching, memory management, and scheduling through the engine's continuous batching infrastructure.
The method validates that the model is a generative model (runner_type == "generate"), resolves default sampling parameters if none are provided, and delegates to the internal _run_completion() method which manages the request lifecycle. A tqdm progress bar is shown by default to indicate generation progress.
Prompts can be provided as:
- A single string
- A list of strings
- A list of
TokensPromptobjects (pre-tokenized inputs) - A list of
TextPromptor otherPromptTypevariants
Usage
Call LLM.generate() after initializing an LLM instance and configuring SamplingParams. For best throughput, pass all prompts in a single call rather than looping with one prompt at a time.
Code Reference
Source Location
- Repository: vllm
- File: vllm/entrypoints/llm.py
- Lines: 396-459
Signature
def generate(
self,
prompts: PromptType | Sequence[PromptType],
sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
*,
use_tqdm: bool | Callable[..., tqdm] = True,
lora_request: list[LoRARequest] | LoRARequest | None = None,
priority: list[int] | None = None,
tokenization_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]
Import
from vllm import LLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompts | PromptType or Sequence[PromptType] | Yes | One or more prompts. Can be strings, TokensPrompt, TextPrompt, or other PromptType variants |
| sampling_params | SamplingParams, Sequence[SamplingParams], or None | No (default: None) | Sampling configuration. A single instance applies to all prompts; a list pairs one-to-one with prompts. None uses model defaults |
| use_tqdm | bool or Callable | No (default: True) | If True, shows a tqdm progress bar. If a callable, used to create a custom progress bar. If False, no progress bar |
| lora_request | LoRARequest, list[LoRARequest], or None | No (default: None) | LoRA adapter(s) to apply during generation |
| priority | list[int] or None | No (default: None) | Per-prompt priority values for priority scheduling. Must match the length of prompts |
| tokenization_kwargs | dict[str, Any] or None | No (default: None) | Additional keyword arguments passed to the tokenizer's encode method |
Outputs
| Name | Type | Description |
|---|---|---|
| results | list[RequestOutput] | A list of RequestOutput objects in the same order as the input prompts. Each RequestOutput contains one or more CompletionOutput objects (depending on the n parameter in SamplingParams)
|
Usage Examples
Single Prompt Generation
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)
output = llm.generate("The future of AI is", params)
print(output[0].outputs[0].text)
Batch Generation with Multiple Prompts
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=128)
prompts = [
"Explain the theory of relativity in simple terms:",
"Write a haiku about programming:",
"What are the three laws of thermodynamics?",
"Describe the process of photosynthesis:",
]
outputs = llm.generate(prompts, params)
for i, output in enumerate(outputs):
print(f"Prompt {i}: {output.outputs[0].text}\n")
Per-Prompt Sampling Parameters
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
prompts = ["Summarize quantum computing:", "Write a creative story about:"]
params_list = [
SamplingParams(temperature=0.1, max_tokens=100), # Factual: low temp
SamplingParams(temperature=0.9, max_tokens=300), # Creative: high temp
]
outputs = llm.generate(prompts, params_list)
for output in outputs:
print(output.outputs[0].text)
Generation without Progress Bar
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(max_tokens=64)
# Suppress the tqdm progress bar
outputs = llm.generate(
["Hello, world!"],
params,
use_tqdm=False,
)