Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Vllm project Vllm Multimodal Prompt Formatting

From Leeroopedia


Knowledge Sources
Domains Prompt Engineering, Vision Language Models, Tokenization
Last Updated 2026-02-08 13:00 GMT

Overview

Constructing correctly formatted prompts with model-specific vision token placeholders is essential for vision-language models to properly associate visual inputs with textual instructions.

Description

Each vision-language model family defines its own prompt format that specifies where and how visual input tokens should appear relative to the text. Getting this format wrong results in degraded output quality, model errors, or completely garbled responses. Key challenges include:

  • Model-specific vision token placeholders: Every VLM family uses different special tokens to mark where image or video data should be injected. There is no universal standard.
  • Chat template structure: Models use different role markers, turn delimiters, and system prompt conventions (e.g., ChatML format, Llama format, custom formats).
  • Multi-image and multi-modal prompts: When a prompt contains multiple images or mixed image/video content, the placeholder tokens and their ordering must follow model-specific conventions.

Common vision token placeholder patterns across major VLM families:

Model Family Image Placeholder Video Placeholder
LLaVA-1.5 <image> N/A
LLaVA-NeXT <image> <video>
Qwen2-VL / Qwen2.5-VL / Qwen3-VL <|vision_start|><|image_pad|><|vision_end|> <|vision_start|><|video_pad|><|vision_end|>
Phi-3-Vision <|image_1|> N/A
InternVL <image> <video>
Gemma-3 <start_of_image> N/A
BLIP-2 (implicit, no token) N/A
Mistral/Pixtral [IMG] N/A

Usage

Use multimodal prompt formatting when:

  • Constructing prompts for any VLM inference, whether offline or online serving.
  • Switching between different VLM architectures (prompts must be reformatted).
  • Building chat-style interactions with VLMs that require multi-turn conversation templates.
  • Handling models that use HuggingFace's apply_chat_template for automatic prompt construction.

Theoretical Basis

Multimodal prompt formatting is rooted in the instruction-following paradigm of language models. VLMs are trained with specific prompt templates during instruction tuning, and deviating from these templates at inference time causes a distribution shift that degrades performance.

The vision token placeholders serve as positional anchors that tell the model's visual token injection mechanism exactly where in the token sequence to insert the projected visual features. The model's architecture typically replaces these placeholder tokens with actual visual embeddings during the forward pass, making their correct placement critical for the cross-modal attention mechanism to function properly.

Two approaches to prompt construction exist:

  1. Manual template construction: Directly building the prompt string with the correct special tokens, role markers, and placeholders. This gives full control but requires knowledge of the exact format.
  2. Tokenizer-based template application: Using AutoTokenizer.apply_chat_template() to automatically format messages according to the model's trained template. This is more robust but requires properly structured message dictionaries.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment