Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm LLM Init Structured

From Leeroopedia


Knowledge Sources
Domains LLM Inference, Structured Output, Engine Initialization
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for initializing the vLLM inference engine with a language model configured for structured output generation, provided by vLLM.

Description

The LLM class is vLLM's primary offline inference entrypoint. Its constructor loads the model, initializes the tokenizer, allocates GPU memory (including KV cache), and configures the guided decoding backend. For structured output workflows, the key considerations are:

  • Model selection: Instruction-tuned models (e.g., "Qwen/Qwen2.5-3B-Instruct") are strongly recommended. They produce semantically meaningful content within the structural constraints.
  • max_model_len: Controls the maximum sequence length. Must accommodate both prompt and generated output. For structured outputs, set this large enough for the full constrained output.
  • structured_outputs_config: An optional StructuredOutputsConfig (or dict) that sets the guided decoding backend (e.g., "xgrammar", "outlines", "auto") and engine-level defaults for fallback, whitespace, and additional properties handling.

The constructor delegates to EngineArgs and LLMEngine.from_engine_args() to build the full engine pipeline.

Usage

Use this class at the start of any offline structured output workflow. Instantiate once, then call .generate() with different prompts and constraints.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/entrypoints/llm.py (lines 199-364)

Signature

class LLM:
    def __init__(
        self,
        model: str,
        *,
        runner: RunnerOption = "auto",
        convert: ConvertOption = "auto",
        tokenizer: str | None = None,
        tokenizer_mode: TokenizerMode | str = "auto",
        skip_tokenizer_init: bool = False,
        trust_remote_code: bool = False,
        tensor_parallel_size: int = 1,
        dtype: ModelDType = "auto",
        quantization: QuantizationMethods | None = None,
        revision: str | None = None,
        seed: int = 0,
        gpu_memory_utilization: float = 0.9,
        swap_space: float = 4,
        enforce_eager: bool = False,
        max_model_len: int | None = None,  # via **kwargs / EngineArgs
        structured_outputs_config: dict[str, Any]
            | StructuredOutputsConfig
            | None = None,
        **kwargs: Any,
    ) -> None:

Import

from vllm import LLM

I/O Contract

Inputs

Name Type Required Description
model str Yes HuggingFace model name or path (e.g., "Qwen/Qwen2.5-3B-Instruct"); instruction-tuned models recommended for structured output
max_model_len None No (default: None, auto-detected) Maximum sequence length; set large enough to accommodate prompt plus full structured output
structured_outputs_config StructuredOutputsConfig | None No (default: None) Engine-level structured output configuration; controls backend selection ("auto", "xgrammar", "outlines"), fallback behavior, and whitespace handling
tensor_parallel_size int No (default: 1) Number of GPUs for tensor parallelism
dtype str No (default: "auto") Model weight data type ("float16", "bfloat16", "float32", or "auto")
gpu_memory_utilization float No (default: 0.9) Fraction of GPU memory to allocate for model, KV cache, and activations
trust_remote_code bool No (default: False) Whether to trust remote code from HuggingFace
seed int No (default: 0) Random seed for reproducibility

Outputs

Name Type Description
LLM instance LLM A fully initialized inference engine ready to serve .generate() calls with structural constraints

Usage Examples

Basic Initialization for Structured Output

from vllm import LLM

llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)

Initialization with Explicit Backend

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2.5-3B-Instruct",
    max_model_len=512,
    structured_outputs_config={"backend": "xgrammar"},
)

Initialization with Full StructuredOutputsConfig

from vllm import LLM
from vllm.config import StructuredOutputsConfig

config = StructuredOutputsConfig(
    backend="auto",
    disable_fallback=False,
    disable_any_whitespace=True,
)
llm = LLM(
    model="Qwen/Qwen2.5-3B-Instruct",
    max_model_len=512,
    structured_outputs_config=config,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment