Principle:Vllm project Vllm Multimodal Prompt Formatting

Knowledge Sources	vLLM HuggingFace Transformers
Domains	Prompt Engineering, Vision Language Models, Tokenization
Last Updated	2026-02-08 13:00 GMT

Overview

Constructing correctly formatted prompts with model-specific vision token placeholders is essential for vision-language models to properly associate visual inputs with textual instructions.

Description

Each vision-language model family defines its own prompt format that specifies where and how visual input tokens should appear relative to the text. Getting this format wrong results in degraded output quality, model errors, or completely garbled responses. Key challenges include:

Model-specific vision token placeholders: Every VLM family uses different special tokens to mark where image or video data should be injected. There is no universal standard.
Chat template structure: Models use different role markers, turn delimiters, and system prompt conventions (e.g., ChatML format, Llama format, custom formats).
Multi-image and multi-modal prompts: When a prompt contains multiple images or mixed image/video content, the placeholder tokens and their ordering must follow model-specific conventions.

Common vision token placeholder patterns across major VLM families:

Model Family	Image Placeholder	Video Placeholder
LLaVA-1.5	`<image>`	N/A
LLaVA-NeXT	`<image>`	`<video>`
Qwen2-VL / Qwen2.5-VL / Qwen3-VL	`<\|vision_start\|><\|image_pad\|><\|vision_end\|>`	`<\|vision_start\|><\|video_pad\|><\|vision_end\|>`
Phi-3-Vision	`<\|image_1\|>`	N/A
InternVL	`<image>`	`<video>`
Gemma-3	`<start_of_image>`	N/A
BLIP-2	(implicit, no token)	N/A
Mistral/Pixtral	`[IMG]`	N/A

Usage

Use multimodal prompt formatting when:

Constructing prompts for any VLM inference, whether offline or online serving.
Switching between different VLM architectures (prompts must be reformatted).
Building chat-style interactions with VLMs that require multi-turn conversation templates.
Handling models that use HuggingFace's apply_chat_template for automatic prompt construction.

Theoretical Basis

Multimodal prompt formatting is rooted in the instruction-following paradigm of language models. VLMs are trained with specific prompt templates during instruction tuning, and deviating from these templates at inference time causes a distribution shift that degrades performance.

The vision token placeholders serve as positional anchors that tell the model's visual token injection mechanism exactly where in the token sequence to insert the projected visual features. The model's architecture typically replaces these placeholder tokens with actual visual embeddings during the forward pass, making their correct placement critical for the cross-modal attention mechanism to function properly.

Two approaches to prompt construction exist:

Manual template construction: Directly building the prompt string with the correct special tokens, role markers, and placeholders. This gives full control but requires knowledge of the exact format.
Tokenizer-based template application: Using AutoTokenizer.apply_chat_template() to automatically format messages according to the model's trained template. This is more robust but requires properly structured message dictionaries.

Related Pages

Implemented By

Implementation:Vllm_project_Vllm_VLM_Prompt_Template

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment