Implementation:Vllm project Vllm LLM Init Structured

Knowledge Sources	vLLM vLLM Docs
Domains	LLM Inference, Structured Output, Engine Initialization
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for initializing the vLLM inference engine with a language model configured for structured output generation, provided by vLLM.

Description

The LLM class is vLLM's primary offline inference entrypoint. Its constructor loads the model, initializes the tokenizer, allocates GPU memory (including KV cache), and configures the guided decoding backend. For structured output workflows, the key considerations are:

Model selection: Instruction-tuned models (e.g., "Qwen/Qwen2.5-3B-Instruct") are strongly recommended. They produce semantically meaningful content within the structural constraints.
max_model_len: Controls the maximum sequence length. Must accommodate both prompt and generated output. For structured outputs, set this large enough for the full constrained output.
structured_outputs_config: An optional StructuredOutputsConfig (or dict) that sets the guided decoding backend (e.g., "xgrammar", "outlines", "auto") and engine-level defaults for fallback, whitespace, and additional properties handling.

The constructor delegates to EngineArgs and LLMEngine.from_engine_args() to build the full engine pipeline.

Usage

Use this class at the start of any offline structured output workflow. Instantiate once, then call .generate() with different prompts and constraints.

Code Reference

Source Location

Repository: vllm
File: vllm/entrypoints/llm.py (lines 199-364)

Signature

class LLM:
    def __init__(
        self,
        model: str,
        *,
        runner: RunnerOption = "auto",
        convert: ConvertOption = "auto",
        tokenizer: str | None = None,
        tokenizer_mode: TokenizerMode | str = "auto",
        skip_tokenizer_init: bool = False,
        trust_remote_code: bool = False,
        tensor_parallel_size: int = 1,
        dtype: ModelDType = "auto",
        quantization: QuantizationMethods | None = None,
        revision: str | None = None,
        seed: int = 0,
        gpu_memory_utilization: float = 0.9,
        swap_space: float = 4,
        enforce_eager: bool = False,
        max_model_len: int | None = None,  # via **kwargs / EngineArgs
        structured_outputs_config: dict[str, Any]
            | StructuredOutputsConfig
            | None = None,
        **kwargs: Any,
    ) -> None:

Import

from vllm import LLM

I/O Contract

Inputs

Name	Type	Required	Description
model	`str`	Yes	HuggingFace model name or path (e.g., `"Qwen/Qwen2.5-3B-Instruct"`); instruction-tuned models recommended for structured output
max_model_len	None	No (default: None, auto-detected)	Maximum sequence length; set large enough to accommodate prompt plus full structured output
structured_outputs_config	StructuredOutputsConfig \| None	No (default: None)	Engine-level structured output configuration; controls backend selection (`"auto"`, `"xgrammar"`, `"outlines"`), fallback behavior, and whitespace handling
tensor_parallel_size	`int`	No (default: 1)	Number of GPUs for tensor parallelism
dtype	`str`	No (default: "auto")	Model weight data type (`"float16"`, `"bfloat16"`, `"float32"`, or `"auto"`)
gpu_memory_utilization	`float`	No (default: 0.9)	Fraction of GPU memory to allocate for model, KV cache, and activations
trust_remote_code	`bool`	No (default: False)	Whether to trust remote code from HuggingFace
seed	`int`	No (default: 0)	Random seed for reproducibility

Outputs

Name	Type	Description
LLM instance	`LLM`	A fully initialized inference engine ready to serve `.generate()` calls with structural constraints

Usage Examples

Basic Initialization for Structured Output

from vllm import LLM

llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)

Initialization with Explicit Backend

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2.5-3B-Instruct",
    max_model_len=512,
    structured_outputs_config={"backend": "xgrammar"},
)

Initialization with Full StructuredOutputsConfig

from vllm import LLM
from vllm.config import StructuredOutputsConfig

config = StructuredOutputsConfig(
    backend="auto",
    disable_fallback=False,
    disable_any_whitespace=True,
)
llm = LLM(
    model="Qwen/Qwen2.5-3B-Instruct",
    max_model_len=512,
    structured_outputs_config=config,
)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Structured_Output_Engine_Initialization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment