Implementation:Vllm project Vllm LLM Generate Constrained

Knowledge Sources	vLLM vLLM Docs
Domains	LLM Inference, Structured Output, Constrained Decoding
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for running constrained text generation with structural constraints applied via logit masking, provided by vLLM.

Description

The LLM.generate() method is vLLM's primary offline generation entrypoint. When the sampling_params argument includes a structured_outputs field, the engine automatically:

Detects the constraint type (JSON, regex, grammar, choice).
Selects the appropriate guided decoding backend (xgrammar, outlines, or guidance) based on the engine configuration and constraint type.
Compiles the constraint into a logits processor that produces token-level masks.
At each decoding step, applies the mask to the model's logits before sampling.

The method accepts a single prompt or a sequence of prompts, and returns a list of RequestOutput objects in the same order. Each RequestOutput contains one or more CompletionOutput objects (depending on n in SamplingParams), with the generated text accessible via .outputs[0].text.

The output is guaranteed to conform to the specified constraint. For JSON constraints, the output is valid JSON matching the schema. For regex constraints, the output matches the pattern. For choice constraints, the output is exactly one of the listed strings.

Usage

Use this method after initializing the LLM engine and constructing SamplingParams with structured_outputs set. Pass one or more prompts and receive constrained outputs.

Code Reference

Source Location

Repository: vllm
File: vllm/entrypoints/llm.py (lines 396-459)

Signature

def generate(
    self,
    prompts: PromptType | Sequence[PromptType],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    *,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: list[LoRARequest] | LoRARequest | None = None,
    priority: list[int] | None = None,
    tokenization_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]:

Import

from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams

I/O Contract

Inputs

Name	Type	Required	Description
prompts	Sequence[PromptType]	Yes	One or more prompts (strings, token ID lists, or dict-based prompts) to generate completions for
sampling_params	Sequence[SamplingParams] \| None	No (default: engine defaults)	Sampling parameters including `structured_outputs` for constrained generation; a single instance applies to all prompts, or a list pairs one-to-one with prompts
use_tqdm	Callable	No (default: True)	Whether to display a progress bar during generation
lora_request	LoRARequest \| None	No (default: None)	LoRA adapter request(s) for generation
priority	None	No (default: None)	Priority values for each prompt (only with priority scheduling)
tokenization_kwargs	None	No (default: None)	Overrides for `tokenizer.encode()`

Outputs

Name	Type	Description
results	`list[RequestOutput]`	List of `RequestOutput` objects, one per input prompt, in the same order. Each contains `.outputs[i].text` with the generated text guaranteed to match the constraint.

Usage Examples

JSON-Constrained Generation

from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams

class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: str

llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)

structured = StructuredOutputsParams(json=CarDescription.model_json_schema())
sampling_params = SamplingParams(
    structured_outputs=structured,
    max_tokens=50,
)

outputs = llm.generate(
    "Generate a JSON describing the most iconic car from the 90s",
    sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
# e.g., '{"brand": "Toyota", "model": "Supra", "car_type": "coupe"}'

Choice-Constrained Classification

from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams

llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)

structured = StructuredOutputsParams(choice=["Positive", "Negative"])
sampling_params = SamplingParams(structured_outputs=structured)

outputs = llm.generate(
    "Classify this sentiment: vLLM is wonderful!",
    sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
# "Positive"

Batch Generation with Different Constraints

from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams

llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)

prompts = [
    "Classify: I love this!",
    "Generate an email for Alan Turing at Enigma",
]
params = [
    SamplingParams(
        structured_outputs=StructuredOutputsParams(
            choice=["Positive", "Negative"]
        ),
    ),
    SamplingParams(
        structured_outputs=StructuredOutputsParams(
            regex=r"\w+@\w+\.com"
        ),
        max_tokens=50,
    ),
]

outputs = llm.generate(prompts, sampling_params=params)
for output in outputs:
    print(output.outputs[0].text)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Constrained_Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment