Implementation:Vllm project Vllm LLM Init

Knowledge Sources	vLLM vLLM Docs
Domains	Machine Learning, Systems Engineering, GPU Computing
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for initializing the vLLM offline inference engine provided by vLLM.

Description

The LLM class is the primary entry point for offline (batch) inference in vLLM. Its constructor loads a HuggingFace-compatible language model, initializes the tokenizer, allocates the KV cache via PagedAttention, and optionally captures CUDA graphs for optimized execution. The resulting object exposes generate() and chat() methods for running inference.

Internally, the constructor builds an EngineArgs object from the provided parameters, creates an LLMEngine instance via LLMEngine.from_engine_args(), and caches model configuration metadata. The engine handles all complexity of distributed execution, memory management, and batching.

Usage

Import and instantiate LLM once at the start of your script. Pass the model name or path as the first argument, along with any hardware configuration parameters. Reuse the instance for all inference calls.

Code Reference

Source Location

Repository: vllm
File: vllm/entrypoints/llm.py
Lines: 101-364

Signature

class LLM:
    def __init__(
        self,
        model: str,
        *,
        runner: RunnerOption = "auto",
        convert: ConvertOption = "auto",
        tokenizer: str | None = None,
        tokenizer_mode: TokenizerMode | str = "auto",
        skip_tokenizer_init: bool = False,
        trust_remote_code: bool = False,
        allowed_local_media_path: str = "",
        allowed_media_domains: list[str] | None = None,
        tensor_parallel_size: int = 1,
        dtype: ModelDType = "auto",
        quantization: QuantizationMethods | None = None,
        revision: str | None = None,
        tokenizer_revision: str | None = None,
        seed: int = 0,
        gpu_memory_utilization: float = 0.9,
        swap_space: float = 4,
        cpu_offload_gb: float = 0,
        enforce_eager: bool = False,
        enable_return_routed_experts: bool = False,
        disable_custom_all_reduce: bool = False,
        hf_token: bool | str | None = None,
        hf_overrides: HfOverrides | None = None,
        mm_processor_kwargs: dict[str, Any] | None = None,
        pooler_config: PoolerConfig | None = None,
        structured_outputs_config: dict[str, Any]
        | StructuredOutputsConfig
        | None = None,
        profiler_config: dict[str, Any] | ProfilerConfig | None = None,
        attention_config: dict[str, Any] | AttentionConfig | None = None,
        kv_cache_memory_bytes: int | None = None,
        compilation_config: int | dict[str, Any] | CompilationConfig | None = None,
        logits_processors: list[str | type[LogitsProcessor]] | None = None,
        **kwargs: Any,
    ) -> None

Import

from vllm import LLM

I/O Contract

Inputs

Name	Type	Required	Description
model	str	Yes	HuggingFace model name or local path (e.g., "meta-llama/Llama-3.1-8B-Instruct")
tokenizer	str or None	No (default: None)	Custom tokenizer name or path. Defaults to the model's tokenizer
tokenizer_mode	str	No (default: "auto")	Tokenizer mode: "auto" uses fast tokenizer if available, "slow" forces the slow tokenizer
trust_remote_code	bool	No (default: False)	Whether to trust and execute remote code from the model repository
tensor_parallel_size	int	No (default: 1)	Number of GPUs for tensor-parallel distributed execution
dtype	str	No (default: "auto")	Data type for weights and activations: "auto", "float32", "float16", "bfloat16"
quantization	str or None	No (default: None)	Quantization method: "awq", "gptq", "fp8", or None
gpu_memory_utilization	float	No (default: 0.9)	Fraction of GPU memory (0-1) reserved for model weights, activations, and KV cache
enforce_eager	bool	No (default: False)	If True, disables CUDA graph capture and uses eager execution only
seed	int	No (default: 0)	Seed for the random number generator
swap_space	float	No (default: 4)	CPU swap space per GPU in GiB
kv_cache_memory_bytes	int or None	No (default: None)	Explicit KV cache size in bytes per GPU. Overrides gpu_memory_utilization when set
**kwargs	Any	No	Additional arguments forwarded to EngineArgs (e.g., max_model_len, pipeline_parallel_size)

Outputs

Name	Type	Description
LLM instance	LLM	A fully initialized inference engine ready for generate() and chat() calls

Usage Examples

Basic Initialization

from vllm import LLM

# Load a model with default settings
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

Multi-GPU with Tensor Parallelism

from vllm import LLM

# Distribute model across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.85,
)

Quantized Model Loading

from vllm import LLM

# Load an AWQ-quantized model
llm = LLM(
    model="TheBloke/Llama-2-7B-Chat-AWQ",
    quantization="awq",
    dtype="float16",
)

Debugging with Eager Mode

from vllm import LLM

# Disable CUDA graphs for debugging
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enforce_eager=True,
    trust_remote_code=True,
)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_LLM_Engine_Initialization

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment