Implementation:Vllm project Vllm LLM Init Multimodal
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, Vision Language Models, GPU Memory Management |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for initializing the vLLM inference engine with multimodal-specific configuration parameters, provided by vLLM's LLM class.
Description
The LLM class is the primary entrypoint for offline (batch) inference in vLLM. When used for VLM inference, several multimodal-specific parameters must be set in addition to the standard model and serving parameters. The constructor creates an EngineArgs dataclass, which is then used to initialize the LLMEngine with proper model configuration, memory allocation, and multimodal processing pipelines.
Key multimodal parameters:
limit_mm_per_prompt: A dictionary mapping modality names to maximum counts per prompt (e.g.,{"image": 1},{"video": 1}). This is passed through toMultiModalConfig.limit_per_prompt.mm_processor_kwargs: A dictionary of keyword arguments forwarded to the model's HuggingFace multimodal processor. Model-specific examples include{"num_crops": 16}for Phi-3.5-Vision,{"min_pixels": 784, "max_pixels": 1003520}for Qwen2.5-VL, and{"do_pan_and_scan": True}for Gemma-3.trust_remote_code: Required by many VLMs (InternVL, Phi-3-Vision, Molmo, etc.) that use custom model code hosted on HuggingFace.enforce_eager: Disables CUDA graph compilation, required by some VLM architectures (GLM-4v, Gemma3n, Idefics3, SmolVLM).hf_overrides: Overrides HuggingFace model config fields, used when architecture detection needs correction (e.g.,{"architectures": ["DeepseekVLV2ForCausalLM"]}for DeepSeek-VL2).
Usage
Use LLM initialization with multimodal configuration when:
- Setting up offline VLM inference with vLLM.
- Configuring memory limits for multimodal serving.
- Loading models that require custom trust or execution settings.
Code Reference
Source Location
- Repository: vllm
- File:
vllm/entrypoints/llm.py(lines 199-364),vllm/engine/arg_utils.py(lines 457-480 for multimodal EngineArgs)
Signature
class LLM:
def __init__(
self,
model: str,
*,
tokenizer: str | None = None,
trust_remote_code: bool = False,
tensor_parallel_size: int = 1,
dtype: ModelDType = "auto",
quantization: QuantizationMethods | None = None,
seed: int = 0,
gpu_memory_utilization: float = 0.9,
enforce_eager: bool = False,
max_model_len: int | None = None, # via **kwargs -> EngineArgs
max_num_seqs: int = 256, # via **kwargs -> EngineArgs
limit_mm_per_prompt: dict | None = None, # via **kwargs -> EngineArgs
mm_processor_kwargs: dict | None = None,
hf_overrides: HfOverrides | None = None,
**kwargs: Any,
) -> None: ...
Import
from vllm import LLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str |
Yes | HuggingFace model ID or local path (e.g., "llava-hf/llava-1.5-7b-hf")
|
| limit_mm_per_prompt | dict[str, int] |
No | Maximum multimodal inputs per prompt per modality (e.g., {"image": 1})
|
| mm_processor_kwargs | dict[str, Any] |
No | Model-specific processor kwargs (e.g., {"num_crops": 16})
|
| trust_remote_code | bool |
No | Whether to trust remote model code (default: False)
|
| enforce_eager | bool |
No | Disable CUDA graph compilation (default: False)
|
| max_model_len | None | No | Maximum sequence length including visual tokens (default: model config value) |
| max_num_seqs | int |
No | Maximum concurrent sequences (default: 256; VLMs often use 2-5)
|
| tensor_parallel_size | int |
No | Number of GPUs for tensor parallelism (default: 1)
|
| hf_overrides | None | No | Overrides for HuggingFace model config fields |
| dtype | str |
No | Data type for model weights (default: "auto"; some VLMs need "bfloat16" or "half")
|
Outputs
| Name | Type | Description |
|---|---|---|
| llm | LLM |
Initialized LLM instance ready for multimodal generation via .generate()
|
Usage Examples
Basic LLaVA-1.5 Configuration
from vllm import LLM
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
max_model_len=4096,
limit_mm_per_prompt={"image": 1},
)
Phi-3.5-Vision with Processor Kwargs
from vllm import LLM
llm = LLM(
model="microsoft/Phi-3.5-vision-instruct",
trust_remote_code=True,
max_model_len=4096,
max_num_seqs=2,
mm_processor_kwargs={"num_crops": 16},
limit_mm_per_prompt={"image": 1},
)
Qwen2.5-VL with Pixel Limits
from vllm import LLM
llm = LLM(
model="Qwen/Qwen2.5-VL-3B-Instruct",
max_model_len=4096,
max_num_seqs=5,
mm_processor_kwargs={
"min_pixels": 28 * 28,
"max_pixels": 1280 * 28 * 28,
"fps": 1,
},
limit_mm_per_prompt={"image": 1},
)
Large Model with Tensor Parallelism
from vllm import LLM
llm = LLM(
model="nvidia/NVLM-D-72B",
trust_remote_code=True,
max_model_len=4096,
tensor_parallel_size=4,
limit_mm_per_prompt={"image": 1},
)
Model with Architecture Override
from vllm import LLM
llm = LLM(
model="deepseek-ai/deepseek-vl2-tiny",
max_model_len=4096,
max_num_seqs=2,
hf_overrides={"architectures": ["DeepseekVLV2ForCausalLM"]},
limit_mm_per_prompt={"image": 1},
)