Workflow:Vllm project Vllm Vision Language Inference

Knowledge Sources	vLLM vLLM Docs
Domains	LLMs, Inference, Multimodal, Vision
Last Updated	2026-02-08 13:00 GMT

Overview

End-to-end process for running inference on vision-language models (VLMs) that accept image, video, or audio inputs alongside text using vLLM.

Description

This workflow covers the procedure for performing multimodal inference with vision-language models. vLLM supports dozens of VLM architectures (LLaVA, Qwen-VL, InternVL, Phi-3-Vision, Gemma-3, etc.) and handles image/video preprocessing, token placeholder insertion, and attention mask construction automatically. The process covers model selection, multimodal input preparation, prompt formatting with model-specific templates, and output generation.

Usage

Execute this workflow when you need to generate text conditioned on visual inputs (images or videos) alongside text prompts. Typical scenarios include image captioning, visual question answering, document OCR, video understanding, and any task requiring the model to reason about visual content.

Execution Steps

Step 1: Select a Vision Language Model

Choose a supported VLM architecture from vLLM's model registry. Each model family has specific prompt formatting requirements, supported modalities (image, video, audio), and memory requirements.

Key considerations:

Check the model's supported modalities (image only vs. image+video)
Some models require trust_remote_code=True
Memory requirements vary significantly (some need multi-GPU with tensor parallelism)
limit_mm_per_prompt controls how many media items per request
mm_processor_kwargs allows model-specific preprocessing overrides

Step 2: Prepare Multimodal Inputs

Load and preprocess image or video data into the format expected by vLLM. Images can be PIL Image objects, file paths, or URLs. Videos are represented as numpy arrays of frames.

Key considerations:

Images should be in RGB mode (use convert_image_mode if needed)
Video inputs require specifying the number of frames to extract
The multimodal processor handles resizing and normalization automatically
mm_processor_cache can be enabled for repeated identical inputs

Step 3: Format Prompts with Media Placeholders

Construct prompts using the model-specific template that includes placeholder tokens for media content. Each VLM family expects a different format for indicating where image or video tokens should be inserted.

Key considerations:

Each model family has a unique placeholder syntax (e.g., <image>, <|image_pad|>, [IMG])
The tokenizer's apply_chat_template method handles formatting for chat-style models
Stop token IDs are model-specific and should be configured accordingly
Using the wrong prompt template will produce poor results

Step 4: Initialize Engine with Multimodal Config

Create the LLM instance with multimodal-aware settings. This includes configuring the maximum model length, per-prompt media limits, processor kwargs, and any model-specific overrides.

Key considerations:

max_model_len should account for both text and image token budget
limit_mm_per_prompt prevents excessive memory usage from many images
enforce_eager may be needed for some VLM architectures
hf_overrides can specify the correct architecture class if auto-detection fails

Step 5: Run Multimodal Generation

Submit the formatted prompts along with multimodal data to the generate method. The multi_modal_data parameter maps modality names to the actual media content.

Key considerations:

Pass media via multi_modal_data dict keyed by modality name
Batch inference works the same as text-only, with media attached per request
Sampling parameters (temperature, max_tokens) apply to the text output
Stop token IDs should match the model's expected end-of-generation tokens

Step 6: Extract and Process Results

Parse the generated text from the output objects. VLM outputs are text-only (describing or responding to the visual input) and follow the same output format as text generation.

Key considerations:

Output text quality depends on prompt formatting correctness
Some models produce structured output (JSON, coordinates) depending on the prompt
skip_special_tokens should be set appropriately for each model

Execution Diagram

GitHub URL

Workflow Repository