Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Vllm project Vllm Vision Language Inference

From Leeroopedia


Knowledge Sources
Domains LLMs, Inference, Multimodal, Vision
Last Updated 2026-02-08 13:00 GMT

Overview

End-to-end process for running inference on vision-language models (VLMs) that accept image, video, or audio inputs alongside text using vLLM.

Description

This workflow covers the procedure for performing multimodal inference with vision-language models. vLLM supports dozens of VLM architectures (LLaVA, Qwen-VL, InternVL, Phi-3-Vision, Gemma-3, etc.) and handles image/video preprocessing, token placeholder insertion, and attention mask construction automatically. The process covers model selection, multimodal input preparation, prompt formatting with model-specific templates, and output generation.

Usage

Execute this workflow when you need to generate text conditioned on visual inputs (images or videos) alongside text prompts. Typical scenarios include image captioning, visual question answering, document OCR, video understanding, and any task requiring the model to reason about visual content.

Execution Steps

Step 1: Select a Vision Language Model

Choose a supported VLM architecture from vLLM's model registry. Each model family has specific prompt formatting requirements, supported modalities (image, video, audio), and memory requirements.

Key considerations:

  • Check the model's supported modalities (image only vs. image+video)
  • Some models require trust_remote_code=True
  • Memory requirements vary significantly (some need multi-GPU with tensor parallelism)
  • limit_mm_per_prompt controls how many media items per request
  • mm_processor_kwargs allows model-specific preprocessing overrides

Step 2: Prepare Multimodal Inputs

Load and preprocess image or video data into the format expected by vLLM. Images can be PIL Image objects, file paths, or URLs. Videos are represented as numpy arrays of frames.

Key considerations:

  • Images should be in RGB mode (use convert_image_mode if needed)
  • Video inputs require specifying the number of frames to extract
  • The multimodal processor handles resizing and normalization automatically
  • mm_processor_cache can be enabled for repeated identical inputs

Step 3: Format Prompts with Media Placeholders

Construct prompts using the model-specific template that includes placeholder tokens for media content. Each VLM family expects a different format for indicating where image or video tokens should be inserted.

Key considerations:

  • Each model family has a unique placeholder syntax (e.g., <image>, <|image_pad|>, [IMG])
  • The tokenizer's apply_chat_template method handles formatting for chat-style models
  • Stop token IDs are model-specific and should be configured accordingly
  • Using the wrong prompt template will produce poor results

Step 4: Initialize Engine with Multimodal Config

Create the LLM instance with multimodal-aware settings. This includes configuring the maximum model length, per-prompt media limits, processor kwargs, and any model-specific overrides.

Key considerations:

  • max_model_len should account for both text and image token budget
  • limit_mm_per_prompt prevents excessive memory usage from many images
  • enforce_eager may be needed for some VLM architectures
  • hf_overrides can specify the correct architecture class if auto-detection fails

Step 5: Run Multimodal Generation

Submit the formatted prompts along with multimodal data to the generate method. The multi_modal_data parameter maps modality names to the actual media content.

Key considerations:

  • Pass media via multi_modal_data dict keyed by modality name
  • Batch inference works the same as text-only, with media attached per request
  • Sampling parameters (temperature, max_tokens) apply to the text output
  • Stop token IDs should match the model's expected end-of-generation tokens

Step 6: Extract and Process Results

Parse the generated text from the output objects. VLM outputs are text-only (describing or responding to the visual input) and follow the same output format as text generation.

Key considerations:

  • Output text quality depends on prompt formatting correctness
  • Some models produce structured output (JSON, coordinates) depending on the prompt
  • skip_special_tokens should be set appropriately for each model

Execution Diagram

GitHub URL

Workflow Repository