Implementation:Vllm project Vllm LLM Generate Constrained
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Structured Output, Constrained Decoding |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for running constrained text generation with structural constraints applied via logit masking, provided by vLLM.
Description
The LLM.generate() method is vLLM's primary offline generation entrypoint. When the sampling_params argument includes a structured_outputs field, the engine automatically:
- Detects the constraint type (JSON, regex, grammar, choice).
- Selects the appropriate guided decoding backend (xgrammar, outlines, or guidance) based on the engine configuration and constraint type.
- Compiles the constraint into a logits processor that produces token-level masks.
- At each decoding step, applies the mask to the model's logits before sampling.
The method accepts a single prompt or a sequence of prompts, and returns a list of RequestOutput objects in the same order. Each RequestOutput contains one or more CompletionOutput objects (depending on n in SamplingParams), with the generated text accessible via .outputs[0].text.
The output is guaranteed to conform to the specified constraint. For JSON constraints, the output is valid JSON matching the schema. For regex constraints, the output matches the pattern. For choice constraints, the output is exactly one of the listed strings.
Usage
Use this method after initializing the LLM engine and constructing SamplingParams with structured_outputs set. Pass one or more prompts and receive constrained outputs.
Code Reference
Source Location
- Repository: vllm
- File:
vllm/entrypoints/llm.py(lines 396-459)
Signature
def generate(
self,
prompts: PromptType | Sequence[PromptType],
sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
*,
use_tqdm: bool | Callable[..., tqdm] = True,
lora_request: list[LoRARequest] | LoRARequest | None = None,
priority: list[int] | None = None,
tokenization_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]:
Import
from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompts | Sequence[PromptType] | Yes | One or more prompts (strings, token ID lists, or dict-based prompts) to generate completions for |
| sampling_params | Sequence[SamplingParams] | None | No (default: engine defaults) | Sampling parameters including structured_outputs for constrained generation; a single instance applies to all prompts, or a list pairs one-to-one with prompts
|
| use_tqdm | Callable | No (default: True) | Whether to display a progress bar during generation |
| lora_request | LoRARequest | None | No (default: None) | LoRA adapter request(s) for generation |
| priority | None | No (default: None) | Priority values for each prompt (only with priority scheduling) |
| tokenization_kwargs | None | No (default: None) | Overrides for tokenizer.encode()
|
Outputs
| Name | Type | Description |
|---|---|---|
| results | list[RequestOutput] |
List of RequestOutput objects, one per input prompt, in the same order. Each contains .outputs[i].text with the generated text guaranteed to match the constraint.
|
Usage Examples
JSON-Constrained Generation
from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams
class CarDescription(BaseModel):
brand: str
model: str
car_type: str
llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)
structured = StructuredOutputsParams(json=CarDescription.model_json_schema())
sampling_params = SamplingParams(
structured_outputs=structured,
max_tokens=50,
)
outputs = llm.generate(
"Generate a JSON describing the most iconic car from the 90s",
sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
# e.g., '{"brand": "Toyota", "model": "Supra", "car_type": "coupe"}'
Choice-Constrained Classification
from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams
llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)
structured = StructuredOutputsParams(choice=["Positive", "Negative"])
sampling_params = SamplingParams(structured_outputs=structured)
outputs = llm.generate(
"Classify this sentiment: vLLM is wonderful!",
sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
# "Positive"
Batch Generation with Different Constraints
from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams
llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)
prompts = [
"Classify: I love this!",
"Generate an email for Alan Turing at Enigma",
]
params = [
SamplingParams(
structured_outputs=StructuredOutputsParams(
choice=["Positive", "Negative"]
),
),
SamplingParams(
structured_outputs=StructuredOutputsParams(
regex=r"\w+@\w+\.com"
),
max_tokens=50,
),
]
outputs = llm.generate(prompts, sampling_params=params)
for output in outputs:
print(output.outputs[0].text)