Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm LLM Generate Constrained

From Leeroopedia


Knowledge Sources
Domains LLM Inference, Structured Output, Constrained Decoding
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for running constrained text generation with structural constraints applied via logit masking, provided by vLLM.

Description

The LLM.generate() method is vLLM's primary offline generation entrypoint. When the sampling_params argument includes a structured_outputs field, the engine automatically:

  1. Detects the constraint type (JSON, regex, grammar, choice).
  2. Selects the appropriate guided decoding backend (xgrammar, outlines, or guidance) based on the engine configuration and constraint type.
  3. Compiles the constraint into a logits processor that produces token-level masks.
  4. At each decoding step, applies the mask to the model's logits before sampling.

The method accepts a single prompt or a sequence of prompts, and returns a list of RequestOutput objects in the same order. Each RequestOutput contains one or more CompletionOutput objects (depending on n in SamplingParams), with the generated text accessible via .outputs[0].text.

The output is guaranteed to conform to the specified constraint. For JSON constraints, the output is valid JSON matching the schema. For regex constraints, the output matches the pattern. For choice constraints, the output is exactly one of the listed strings.

Usage

Use this method after initializing the LLM engine and constructing SamplingParams with structured_outputs set. Pass one or more prompts and receive constrained outputs.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/entrypoints/llm.py (lines 396-459)

Signature

def generate(
    self,
    prompts: PromptType | Sequence[PromptType],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    *,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: list[LoRARequest] | LoRARequest | None = None,
    priority: list[int] | None = None,
    tokenization_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]:

Import

from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams

I/O Contract

Inputs

Name Type Required Description
prompts Sequence[PromptType] Yes One or more prompts (strings, token ID lists, or dict-based prompts) to generate completions for
sampling_params Sequence[SamplingParams] | None No (default: engine defaults) Sampling parameters including structured_outputs for constrained generation; a single instance applies to all prompts, or a list pairs one-to-one with prompts
use_tqdm Callable No (default: True) Whether to display a progress bar during generation
lora_request LoRARequest | None No (default: None) LoRA adapter request(s) for generation
priority None No (default: None) Priority values for each prompt (only with priority scheduling)
tokenization_kwargs None No (default: None) Overrides for tokenizer.encode()

Outputs

Name Type Description
results list[RequestOutput] List of RequestOutput objects, one per input prompt, in the same order. Each contains .outputs[i].text with the generated text guaranteed to match the constraint.

Usage Examples

JSON-Constrained Generation

from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams

class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: str

llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)

structured = StructuredOutputsParams(json=CarDescription.model_json_schema())
sampling_params = SamplingParams(
    structured_outputs=structured,
    max_tokens=50,
)

outputs = llm.generate(
    "Generate a JSON describing the most iconic car from the 90s",
    sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
# e.g., '{"brand": "Toyota", "model": "Supra", "car_type": "coupe"}'

Choice-Constrained Classification

from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams

llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)

structured = StructuredOutputsParams(choice=["Positive", "Negative"])
sampling_params = SamplingParams(structured_outputs=structured)

outputs = llm.generate(
    "Classify this sentiment: vLLM is wonderful!",
    sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
# "Positive"

Batch Generation with Different Constraints

from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams

llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)

prompts = [
    "Classify: I love this!",
    "Generate an email for Alan Turing at Enigma",
]
params = [
    SamplingParams(
        structured_outputs=StructuredOutputsParams(
            choice=["Positive", "Negative"]
        ),
    ),
    SamplingParams(
        structured_outputs=StructuredOutputsParams(
            regex=r"\w+@\w+\.com"
        ),
        max_tokens=50,
    ),
]

outputs = llm.generate(prompts, sampling_params=params)
for output in outputs:
    print(output.outputs[0].text)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment