Implementation:Vllm project Vllm LLM Init Multimodal

Knowledge Sources	vLLM vLLM Docs
Domains	LLM Serving, Vision Language Models, GPU Memory Management
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for initializing the vLLM inference engine with multimodal-specific configuration parameters, provided by vLLM's LLM class.

Description

The LLM class is the primary entrypoint for offline (batch) inference in vLLM. When used for VLM inference, several multimodal-specific parameters must be set in addition to the standard model and serving parameters. The constructor creates an EngineArgs dataclass, which is then used to initialize the LLMEngine with proper model configuration, memory allocation, and multimodal processing pipelines.

Key multimodal parameters:

limit_mm_per_prompt: A dictionary mapping modality names to maximum counts per prompt (e.g., {"image": 1}, {"video": 1}). This is passed through to MultiModalConfig.limit_per_prompt.
mm_processor_kwargs: A dictionary of keyword arguments forwarded to the model's HuggingFace multimodal processor. Model-specific examples include {"num_crops": 16} for Phi-3.5-Vision, {"min_pixels": 784, "max_pixels": 1003520} for Qwen2.5-VL, and {"do_pan_and_scan": True} for Gemma-3.
trust_remote_code: Required by many VLMs (InternVL, Phi-3-Vision, Molmo, etc.) that use custom model code hosted on HuggingFace.
enforce_eager: Disables CUDA graph compilation, required by some VLM architectures (GLM-4v, Gemma3n, Idefics3, SmolVLM).
hf_overrides: Overrides HuggingFace model config fields, used when architecture detection needs correction (e.g., {"architectures": ["DeepseekVLV2ForCausalLM"]} for DeepSeek-VL2).

Usage

Use LLM initialization with multimodal configuration when:

Setting up offline VLM inference with vLLM.
Configuring memory limits for multimodal serving.
Loading models that require custom trust or execution settings.

Code Reference

Source Location

Repository: vllm
File: vllm/entrypoints/llm.py (lines 199-364), vllm/engine/arg_utils.py (lines 457-480 for multimodal EngineArgs)

Signature

class LLM:
    def __init__(
        self,
        model: str,
        *,
        tokenizer: str | None = None,
        trust_remote_code: bool = False,
        tensor_parallel_size: int = 1,
        dtype: ModelDType = "auto",
        quantization: QuantizationMethods | None = None,
        seed: int = 0,
        gpu_memory_utilization: float = 0.9,
        enforce_eager: bool = False,
        max_model_len: int | None = None,       # via **kwargs -> EngineArgs
        max_num_seqs: int = 256,                 # via **kwargs -> EngineArgs
        limit_mm_per_prompt: dict | None = None, # via **kwargs -> EngineArgs
        mm_processor_kwargs: dict | None = None,
        hf_overrides: HfOverrides | None = None,
        **kwargs: Any,
    ) -> None: ...

Import

from vllm import LLM

I/O Contract

Inputs

Name	Type	Required	Description
model	`str`	Yes	HuggingFace model ID or local path (e.g., `"llava-hf/llava-1.5-7b-hf"`)
limit_mm_per_prompt	`dict[str, int]`	No	Maximum multimodal inputs per prompt per modality (e.g., `{"image": 1}`)
mm_processor_kwargs	`dict[str, Any]`	No	Model-specific processor kwargs (e.g., `{"num_crops": 16}`)
trust_remote_code	`bool`	No	Whether to trust remote model code (default: `False`)
enforce_eager	`bool`	No	Disable CUDA graph compilation (default: `False`)
max_model_len	None	No	Maximum sequence length including visual tokens (default: model config value)
max_num_seqs	`int`	No	Maximum concurrent sequences (default: `256`; VLMs often use 2-5)
tensor_parallel_size	`int`	No	Number of GPUs for tensor parallelism (default: `1`)
hf_overrides	None	No	Overrides for HuggingFace model config fields
dtype	`str`	No	Data type for model weights (default: `"auto"`; some VLMs need `"bfloat16"` or `"half"`)

Outputs

Name	Type	Description
llm	`LLM`	Initialized LLM instance ready for multimodal generation via `.generate()`

Usage Examples

Basic LLaVA-1.5 Configuration

from vllm import LLM

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    max_model_len=4096,
    limit_mm_per_prompt={"image": 1},
)

Phi-3.5-Vision with Processor Kwargs

from vllm import LLM

llm = LLM(
    model="microsoft/Phi-3.5-vision-instruct",
    trust_remote_code=True,
    max_model_len=4096,
    max_num_seqs=2,
    mm_processor_kwargs={"num_crops": 16},
    limit_mm_per_prompt={"image": 1},
)

Qwen2.5-VL with Pixel Limits

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    max_model_len=4096,
    max_num_seqs=5,
    mm_processor_kwargs={
        "min_pixels": 28 * 28,
        "max_pixels": 1280 * 28 * 28,
        "fps": 1,
    },
    limit_mm_per_prompt={"image": 1},
)

Large Model with Tensor Parallelism

from vllm import LLM

llm = LLM(
    model="nvidia/NVLM-D-72B",
    trust_remote_code=True,
    max_model_len=4096,
    tensor_parallel_size=4,
    limit_mm_per_prompt={"image": 1},
)

Model with Architecture Override

from vllm import LLM

llm = LLM(
    model="deepseek-ai/deepseek-vl2-tiny",
    max_model_len=4096,
    max_num_seqs=2,
    hf_overrides={"architectures": ["DeepseekVLV2ForCausalLM"]},
    limit_mm_per_prompt={"image": 1},
)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Multimodal_Engine_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment