Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm LLM Init

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Systems Engineering, GPU Computing
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for initializing the vLLM offline inference engine provided by vLLM.

Description

The LLM class is the primary entry point for offline (batch) inference in vLLM. Its constructor loads a HuggingFace-compatible language model, initializes the tokenizer, allocates the KV cache via PagedAttention, and optionally captures CUDA graphs for optimized execution. The resulting object exposes generate() and chat() methods for running inference.

Internally, the constructor builds an EngineArgs object from the provided parameters, creates an LLMEngine instance via LLMEngine.from_engine_args(), and caches model configuration metadata. The engine handles all complexity of distributed execution, memory management, and batching.

Usage

Import and instantiate LLM once at the start of your script. Pass the model name or path as the first argument, along with any hardware configuration parameters. Reuse the instance for all inference calls.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/entrypoints/llm.py
  • Lines: 101-364

Signature

class LLM:
    def __init__(
        self,
        model: str,
        *,
        runner: RunnerOption = "auto",
        convert: ConvertOption = "auto",
        tokenizer: str | None = None,
        tokenizer_mode: TokenizerMode | str = "auto",
        skip_tokenizer_init: bool = False,
        trust_remote_code: bool = False,
        allowed_local_media_path: str = "",
        allowed_media_domains: list[str] | None = None,
        tensor_parallel_size: int = 1,
        dtype: ModelDType = "auto",
        quantization: QuantizationMethods | None = None,
        revision: str | None = None,
        tokenizer_revision: str | None = None,
        seed: int = 0,
        gpu_memory_utilization: float = 0.9,
        swap_space: float = 4,
        cpu_offload_gb: float = 0,
        enforce_eager: bool = False,
        enable_return_routed_experts: bool = False,
        disable_custom_all_reduce: bool = False,
        hf_token: bool | str | None = None,
        hf_overrides: HfOverrides | None = None,
        mm_processor_kwargs: dict[str, Any] | None = None,
        pooler_config: PoolerConfig | None = None,
        structured_outputs_config: dict[str, Any]
        | StructuredOutputsConfig
        | None = None,
        profiler_config: dict[str, Any] | ProfilerConfig | None = None,
        attention_config: dict[str, Any] | AttentionConfig | None = None,
        kv_cache_memory_bytes: int | None = None,
        compilation_config: int | dict[str, Any] | CompilationConfig | None = None,
        logits_processors: list[str | type[LogitsProcessor]] | None = None,
        **kwargs: Any,
    ) -> None

Import

from vllm import LLM

I/O Contract

Inputs

Name Type Required Description
model str Yes HuggingFace model name or local path (e.g., "meta-llama/Llama-3.1-8B-Instruct")
tokenizer str or None No (default: None) Custom tokenizer name or path. Defaults to the model's tokenizer
tokenizer_mode str No (default: "auto") Tokenizer mode: "auto" uses fast tokenizer if available, "slow" forces the slow tokenizer
trust_remote_code bool No (default: False) Whether to trust and execute remote code from the model repository
tensor_parallel_size int No (default: 1) Number of GPUs for tensor-parallel distributed execution
dtype str No (default: "auto") Data type for weights and activations: "auto", "float32", "float16", "bfloat16"
quantization str or None No (default: None) Quantization method: "awq", "gptq", "fp8", or None
gpu_memory_utilization float No (default: 0.9) Fraction of GPU memory (0-1) reserved for model weights, activations, and KV cache
enforce_eager bool No (default: False) If True, disables CUDA graph capture and uses eager execution only
seed int No (default: 0) Seed for the random number generator
swap_space float No (default: 4) CPU swap space per GPU in GiB
kv_cache_memory_bytes int or None No (default: None) Explicit KV cache size in bytes per GPU. Overrides gpu_memory_utilization when set
**kwargs Any No Additional arguments forwarded to EngineArgs (e.g., max_model_len, pipeline_parallel_size)

Outputs

Name Type Description
LLM instance LLM A fully initialized inference engine ready for generate() and chat() calls

Usage Examples

Basic Initialization

from vllm import LLM

# Load a model with default settings
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

Multi-GPU with Tensor Parallelism

from vllm import LLM

# Distribute model across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.85,
)

Quantized Model Loading

from vllm import LLM

# Load an AWQ-quantized model
llm = LLM(
    model="TheBloke/Llama-2-7B-Chat-AWQ",
    quantization="awq",
    dtype="float16",
)

Debugging with Eager Mode

from vllm import LLM

# Disable CUDA graphs for debugging
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enforce_eager=True,
    trust_remote_code=True,
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment