Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm LLM Init Multimodal

From Leeroopedia


Knowledge Sources
Domains LLM Serving, Vision Language Models, GPU Memory Management
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for initializing the vLLM inference engine with multimodal-specific configuration parameters, provided by vLLM's LLM class.

Description

The LLM class is the primary entrypoint for offline (batch) inference in vLLM. When used for VLM inference, several multimodal-specific parameters must be set in addition to the standard model and serving parameters. The constructor creates an EngineArgs dataclass, which is then used to initialize the LLMEngine with proper model configuration, memory allocation, and multimodal processing pipelines.

Key multimodal parameters:

  • limit_mm_per_prompt: A dictionary mapping modality names to maximum counts per prompt (e.g., {"image": 1}, {"video": 1}). This is passed through to MultiModalConfig.limit_per_prompt.
  • mm_processor_kwargs: A dictionary of keyword arguments forwarded to the model's HuggingFace multimodal processor. Model-specific examples include {"num_crops": 16} for Phi-3.5-Vision, {"min_pixels": 784, "max_pixels": 1003520} for Qwen2.5-VL, and {"do_pan_and_scan": True} for Gemma-3.
  • trust_remote_code: Required by many VLMs (InternVL, Phi-3-Vision, Molmo, etc.) that use custom model code hosted on HuggingFace.
  • enforce_eager: Disables CUDA graph compilation, required by some VLM architectures (GLM-4v, Gemma3n, Idefics3, SmolVLM).
  • hf_overrides: Overrides HuggingFace model config fields, used when architecture detection needs correction (e.g., {"architectures": ["DeepseekVLV2ForCausalLM"]} for DeepSeek-VL2).

Usage

Use LLM initialization with multimodal configuration when:

  • Setting up offline VLM inference with vLLM.
  • Configuring memory limits for multimodal serving.
  • Loading models that require custom trust or execution settings.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/entrypoints/llm.py (lines 199-364), vllm/engine/arg_utils.py (lines 457-480 for multimodal EngineArgs)

Signature

class LLM:
    def __init__(
        self,
        model: str,
        *,
        tokenizer: str | None = None,
        trust_remote_code: bool = False,
        tensor_parallel_size: int = 1,
        dtype: ModelDType = "auto",
        quantization: QuantizationMethods | None = None,
        seed: int = 0,
        gpu_memory_utilization: float = 0.9,
        enforce_eager: bool = False,
        max_model_len: int | None = None,       # via **kwargs -> EngineArgs
        max_num_seqs: int = 256,                 # via **kwargs -> EngineArgs
        limit_mm_per_prompt: dict | None = None, # via **kwargs -> EngineArgs
        mm_processor_kwargs: dict | None = None,
        hf_overrides: HfOverrides | None = None,
        **kwargs: Any,
    ) -> None: ...

Import

from vllm import LLM

I/O Contract

Inputs

Name Type Required Description
model str Yes HuggingFace model ID or local path (e.g., "llava-hf/llava-1.5-7b-hf")
limit_mm_per_prompt dict[str, int] No Maximum multimodal inputs per prompt per modality (e.g., {"image": 1})
mm_processor_kwargs dict[str, Any] No Model-specific processor kwargs (e.g., {"num_crops": 16})
trust_remote_code bool No Whether to trust remote model code (default: False)
enforce_eager bool No Disable CUDA graph compilation (default: False)
max_model_len None No Maximum sequence length including visual tokens (default: model config value)
max_num_seqs int No Maximum concurrent sequences (default: 256; VLMs often use 2-5)
tensor_parallel_size int No Number of GPUs for tensor parallelism (default: 1)
hf_overrides None No Overrides for HuggingFace model config fields
dtype str No Data type for model weights (default: "auto"; some VLMs need "bfloat16" or "half")

Outputs

Name Type Description
llm LLM Initialized LLM instance ready for multimodal generation via .generate()

Usage Examples

Basic LLaVA-1.5 Configuration

from vllm import LLM

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    max_model_len=4096,
    limit_mm_per_prompt={"image": 1},
)

Phi-3.5-Vision with Processor Kwargs

from vllm import LLM

llm = LLM(
    model="microsoft/Phi-3.5-vision-instruct",
    trust_remote_code=True,
    max_model_len=4096,
    max_num_seqs=2,
    mm_processor_kwargs={"num_crops": 16},
    limit_mm_per_prompt={"image": 1},
)

Qwen2.5-VL with Pixel Limits

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    max_model_len=4096,
    max_num_seqs=5,
    mm_processor_kwargs={
        "min_pixels": 28 * 28,
        "max_pixels": 1280 * 28 * 28,
        "fps": 1,
    },
    limit_mm_per_prompt={"image": 1},
)

Large Model with Tensor Parallelism

from vllm import LLM

llm = LLM(
    model="nvidia/NVLM-D-72B",
    trust_remote_code=True,
    max_model_len=4096,
    tensor_parallel_size=4,
    limit_mm_per_prompt={"image": 1},
)

Model with Architecture Override

from vllm import LLM

llm = LLM(
    model="deepseek-ai/deepseek-vl2-tiny",
    max_model_len=4096,
    max_num_seqs=2,
    hf_overrides={"architectures": ["DeepseekVLV2ForCausalLM"]},
    limit_mm_per_prompt={"image": 1},
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment