Implementation:Vllm project Vllm LLM Init
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Systems Engineering, GPU Computing |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for initializing the vLLM offline inference engine provided by vLLM.
Description
The LLM class is the primary entry point for offline (batch) inference in vLLM. Its constructor loads a HuggingFace-compatible language model, initializes the tokenizer, allocates the KV cache via PagedAttention, and optionally captures CUDA graphs for optimized execution. The resulting object exposes generate() and chat() methods for running inference.
Internally, the constructor builds an EngineArgs object from the provided parameters, creates an LLMEngine instance via LLMEngine.from_engine_args(), and caches model configuration metadata. The engine handles all complexity of distributed execution, memory management, and batching.
Usage
Import and instantiate LLM once at the start of your script. Pass the model name or path as the first argument, along with any hardware configuration parameters. Reuse the instance for all inference calls.
Code Reference
Source Location
- Repository: vllm
- File: vllm/entrypoints/llm.py
- Lines: 101-364
Signature
class LLM:
def __init__(
self,
model: str,
*,
runner: RunnerOption = "auto",
convert: ConvertOption = "auto",
tokenizer: str | None = None,
tokenizer_mode: TokenizerMode | str = "auto",
skip_tokenizer_init: bool = False,
trust_remote_code: bool = False,
allowed_local_media_path: str = "",
allowed_media_domains: list[str] | None = None,
tensor_parallel_size: int = 1,
dtype: ModelDType = "auto",
quantization: QuantizationMethods | None = None,
revision: str | None = None,
tokenizer_revision: str | None = None,
seed: int = 0,
gpu_memory_utilization: float = 0.9,
swap_space: float = 4,
cpu_offload_gb: float = 0,
enforce_eager: bool = False,
enable_return_routed_experts: bool = False,
disable_custom_all_reduce: bool = False,
hf_token: bool | str | None = None,
hf_overrides: HfOverrides | None = None,
mm_processor_kwargs: dict[str, Any] | None = None,
pooler_config: PoolerConfig | None = None,
structured_outputs_config: dict[str, Any]
| StructuredOutputsConfig
| None = None,
profiler_config: dict[str, Any] | ProfilerConfig | None = None,
attention_config: dict[str, Any] | AttentionConfig | None = None,
kv_cache_memory_bytes: int | None = None,
compilation_config: int | dict[str, Any] | CompilationConfig | None = None,
logits_processors: list[str | type[LogitsProcessor]] | None = None,
**kwargs: Any,
) -> None
Import
from vllm import LLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | HuggingFace model name or local path (e.g., "meta-llama/Llama-3.1-8B-Instruct") |
| tokenizer | str or None | No (default: None) | Custom tokenizer name or path. Defaults to the model's tokenizer |
| tokenizer_mode | str | No (default: "auto") | Tokenizer mode: "auto" uses fast tokenizer if available, "slow" forces the slow tokenizer |
| trust_remote_code | bool | No (default: False) | Whether to trust and execute remote code from the model repository |
| tensor_parallel_size | int | No (default: 1) | Number of GPUs for tensor-parallel distributed execution |
| dtype | str | No (default: "auto") | Data type for weights and activations: "auto", "float32", "float16", "bfloat16" |
| quantization | str or None | No (default: None) | Quantization method: "awq", "gptq", "fp8", or None |
| gpu_memory_utilization | float | No (default: 0.9) | Fraction of GPU memory (0-1) reserved for model weights, activations, and KV cache |
| enforce_eager | bool | No (default: False) | If True, disables CUDA graph capture and uses eager execution only |
| seed | int | No (default: 0) | Seed for the random number generator |
| swap_space | float | No (default: 4) | CPU swap space per GPU in GiB |
| kv_cache_memory_bytes | int or None | No (default: None) | Explicit KV cache size in bytes per GPU. Overrides gpu_memory_utilization when set |
| **kwargs | Any | No | Additional arguments forwarded to EngineArgs (e.g., max_model_len, pipeline_parallel_size) |
Outputs
| Name | Type | Description |
|---|---|---|
| LLM instance | LLM | A fully initialized inference engine ready for generate() and chat() calls |
Usage Examples
Basic Initialization
from vllm import LLM
# Load a model with default settings
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
Multi-GPU with Tensor Parallelism
from vllm import LLM
# Distribute model across 4 GPUs
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
gpu_memory_utilization=0.85,
)
Quantized Model Loading
from vllm import LLM
# Load an AWQ-quantized model
llm = LLM(
model="TheBloke/Llama-2-7B-Chat-AWQ",
quantization="awq",
dtype="float16",
)
Debugging with Eager Mode
from vllm import LLM
# Disable CUDA graphs for debugging
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enforce_eager=True,
trust_remote_code=True,
)