Principle:Vllm project Vllm Multimodal Prompt Formatting
| Knowledge Sources | |
|---|---|
| Domains | Prompt Engineering, Vision Language Models, Tokenization |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Constructing correctly formatted prompts with model-specific vision token placeholders is essential for vision-language models to properly associate visual inputs with textual instructions.
Description
Each vision-language model family defines its own prompt format that specifies where and how visual input tokens should appear relative to the text. Getting this format wrong results in degraded output quality, model errors, or completely garbled responses. Key challenges include:
- Model-specific vision token placeholders: Every VLM family uses different special tokens to mark where image or video data should be injected. There is no universal standard.
- Chat template structure: Models use different role markers, turn delimiters, and system prompt conventions (e.g., ChatML format, Llama format, custom formats).
- Multi-image and multi-modal prompts: When a prompt contains multiple images or mixed image/video content, the placeholder tokens and their ordering must follow model-specific conventions.
Common vision token placeholder patterns across major VLM families:
| Model Family | Image Placeholder | Video Placeholder |
|---|---|---|
| LLaVA-1.5 | <image> |
N/A |
| LLaVA-NeXT | <image> |
<video>
|
| Qwen2-VL / Qwen2.5-VL / Qwen3-VL | <|vision_start|><|image_pad|><|vision_end|> |
<|vision_start|><|video_pad|><|vision_end|>
|
| Phi-3-Vision | <|image_1|> |
N/A |
| InternVL | <image> |
<video>
|
| Gemma-3 | <start_of_image> |
N/A |
| BLIP-2 | (implicit, no token) | N/A |
| Mistral/Pixtral | [IMG] |
N/A |
Usage
Use multimodal prompt formatting when:
- Constructing prompts for any VLM inference, whether offline or online serving.
- Switching between different VLM architectures (prompts must be reformatted).
- Building chat-style interactions with VLMs that require multi-turn conversation templates.
- Handling models that use HuggingFace's
apply_chat_templatefor automatic prompt construction.
Theoretical Basis
Multimodal prompt formatting is rooted in the instruction-following paradigm of language models. VLMs are trained with specific prompt templates during instruction tuning, and deviating from these templates at inference time causes a distribution shift that degrades performance.
The vision token placeholders serve as positional anchors that tell the model's visual token injection mechanism exactly where in the token sequence to insert the projected visual features. The model's architecture typically replaces these placeholder tokens with actual visual embeddings during the forward pass, making their correct placement critical for the cross-modal attention mechanism to function properly.
Two approaches to prompt construction exist:
- Manual template construction: Directly building the prompt string with the correct special tokens, role markers, and placeholders. This gives full control but requires knowledge of the exact format.
- Tokenizer-based template application: Using
AutoTokenizer.apply_chat_template()to automatically format messages according to the model's trained template. This is more robust but requires properly structured message dictionaries.