Principle:Vllm project Vllm Multimodal Engine Configuration
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, Vision Language Models, GPU Memory Management |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Properly configuring the inference engine with multimodal-specific parameters determines whether a vision-language model can be loaded, how much GPU memory it consumes, and how many concurrent requests it can handle.
Description
Vision-language models impose additional configuration requirements beyond standard text-only LLMs. The vLLM engine must be configured to handle visual inputs, manage the additional memory overhead of vision encoders, and enforce per-request limits on multimodal content. Key configuration dimensions include:
- Multimodal input limits (
limit_mm_per_prompt): Controls the maximum number of images, videos, or audio inputs allowed per prompt. This is critical for memory planning -- each image or video frame consumes visual tokens that increase KV cache requirements. The typical setting is{"image": 1}for single-image tasks. - Processor configuration (
mm_processor_kwargs): Passes model-specific parameters to the HuggingFace multimodal processor, controlling image resolution, cropping strategy, number of crops, pixel limits, and video frame rate. For example, Phi-3.5-Vision uses{"num_crops": 16}and Qwen2.5-VL uses{"min_pixels": 784, "max_pixels": 1003520}. - Model context length (
max_model_len): The maximum sequence length including both text and visual tokens. VLMs typically require larger context windows because visual inputs are tokenized into hundreds or thousands of tokens. - Concurrency (
max_num_seqs): The maximum number of sequences processed simultaneously. VLMs often need lower concurrency (2-5) than text-only models due to the memory cost of visual features. - Trust and execution settings: Many VLMs require
trust_remote_code=Truefor custom modeling code andenforce_eager=Trueto disable CUDA graph compilation when the model architecture is not compatible with it. - Tensor parallelism (
tensor_parallel_size): Large VLMs (e.g., NVLM-D-72B, GLM-4.5V, Llama-4-Scout) require multi-GPU parallelism.
Usage
Use multimodal engine configuration when:
- Initializing a vLLM
LLMinstance for VLM inference. - Tuning memory usage and throughput for production VLM serving.
- Resolving out-of-memory errors when loading VLMs.
- Switching between models with different resource requirements.
Theoretical Basis
Engine configuration for VLMs involves balancing three competing resource constraints:
- GPU memory: The vision encoder, language model weights, KV cache, and visual feature buffers must all fit in GPU memory. The
max_model_lenandmax_num_seqsparameters directly control KV cache allocation, whilelimit_mm_per_promptbounds the per-request visual token overhead. - Throughput: Higher
max_num_seqsenables more concurrent request processing through continuous batching, but each VLM request consumes more memory than text-only requests due to visual tokens. - Quality: Higher image resolution (controlled via
mm_processor_kwargs) produces more visual tokens and better visual understanding, but increases memory consumption and reduces throughput.
The limit_mm_per_prompt parameter is especially important for VLMs because visual inputs have non-uniform token counts. A single high-resolution image may produce 2,000+ visual tokens, while a low-resolution image produces only a few hundred. Capping the number of multimodal inputs per prompt ensures predictable memory usage across requests.