Implementation:Vllm project Vllm LLM Init Structured
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Structured Output, Engine Initialization |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for initializing the vLLM inference engine with a language model configured for structured output generation, provided by vLLM.
Description
The LLM class is vLLM's primary offline inference entrypoint. Its constructor loads the model, initializes the tokenizer, allocates GPU memory (including KV cache), and configures the guided decoding backend. For structured output workflows, the key considerations are:
- Model selection: Instruction-tuned models (e.g.,
"Qwen/Qwen2.5-3B-Instruct") are strongly recommended. They produce semantically meaningful content within the structural constraints. - max_model_len: Controls the maximum sequence length. Must accommodate both prompt and generated output. For structured outputs, set this large enough for the full constrained output.
- structured_outputs_config: An optional
StructuredOutputsConfig(or dict) that sets the guided decoding backend (e.g.,"xgrammar","outlines","auto") and engine-level defaults for fallback, whitespace, and additional properties handling.
The constructor delegates to EngineArgs and LLMEngine.from_engine_args() to build the full engine pipeline.
Usage
Use this class at the start of any offline structured output workflow. Instantiate once, then call .generate() with different prompts and constraints.
Code Reference
Source Location
- Repository: vllm
- File:
vllm/entrypoints/llm.py(lines 199-364)
Signature
class LLM:
def __init__(
self,
model: str,
*,
runner: RunnerOption = "auto",
convert: ConvertOption = "auto",
tokenizer: str | None = None,
tokenizer_mode: TokenizerMode | str = "auto",
skip_tokenizer_init: bool = False,
trust_remote_code: bool = False,
tensor_parallel_size: int = 1,
dtype: ModelDType = "auto",
quantization: QuantizationMethods | None = None,
revision: str | None = None,
seed: int = 0,
gpu_memory_utilization: float = 0.9,
swap_space: float = 4,
enforce_eager: bool = False,
max_model_len: int | None = None, # via **kwargs / EngineArgs
structured_outputs_config: dict[str, Any]
| StructuredOutputsConfig
| None = None,
**kwargs: Any,
) -> None:
Import
from vllm import LLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str |
Yes | HuggingFace model name or path (e.g., "Qwen/Qwen2.5-3B-Instruct"); instruction-tuned models recommended for structured output
|
| max_model_len | None | No (default: None, auto-detected) | Maximum sequence length; set large enough to accommodate prompt plus full structured output |
| structured_outputs_config | StructuredOutputsConfig | None | No (default: None) | Engine-level structured output configuration; controls backend selection ("auto", "xgrammar", "outlines"), fallback behavior, and whitespace handling
|
| tensor_parallel_size | int |
No (default: 1) | Number of GPUs for tensor parallelism |
| dtype | str |
No (default: "auto") | Model weight data type ("float16", "bfloat16", "float32", or "auto")
|
| gpu_memory_utilization | float |
No (default: 0.9) | Fraction of GPU memory to allocate for model, KV cache, and activations |
| trust_remote_code | bool |
No (default: False) | Whether to trust remote code from HuggingFace |
| seed | int |
No (default: 0) | Random seed for reproducibility |
Outputs
| Name | Type | Description |
|---|---|---|
| LLM instance | LLM |
A fully initialized inference engine ready to serve .generate() calls with structural constraints
|
Usage Examples
Basic Initialization for Structured Output
from vllm import LLM
llm = LLM(model="Qwen/Qwen2.5-3B-Instruct", max_model_len=100)
Initialization with Explicit Backend
from vllm import LLM
llm = LLM(
model="Qwen/Qwen2.5-3B-Instruct",
max_model_len=512,
structured_outputs_config={"backend": "xgrammar"},
)
Initialization with Full StructuredOutputsConfig
from vllm import LLM
from vllm.config import StructuredOutputsConfig
config = StructuredOutputsConfig(
backend="auto",
disable_fallback=False,
disable_any_whitespace=True,
)
llm = LLM(
model="Qwen/Qwen2.5-3B-Instruct",
max_model_len=512,
structured_outputs_config=config,
)