Implementation:Vllm project Vllm VLM Prompt Template

Knowledge Sources	vLLM vLLM Docs HuggingFace Transformers
Domains	Prompt Engineering, Vision Language Models, Tokenization
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for constructing VLM-compatible prompts with vision token placeholders, provided by HuggingFace Transformers' chat template system and vLLM's example prompt patterns.

Description

VLM prompt formatting in vLLM uses two primary approaches:

Direct string construction: Building prompt strings manually with the exact special tokens, role markers, and vision placeholders for each model. This approach is used by most runner functions in vision_language.py.
AutoTokenizer.apply_chat_template(): Using HuggingFace's tokenizer-based template system to automatically format structured message lists into the correct prompt string. Several models (InternVL, Eagle2.5, H2OVL, Ovis, etc.) use this approach.

The apply_chat_template method accepts a list of message dictionaries with role and content keys, applies the model's Jinja2-based chat template, and returns the formatted prompt string. The tokenize=False parameter returns the string (not token IDs), and add_generation_prompt=True appends the assistant turn marker.

Usage

Use VLM prompt templates when:

Formatting prompts for any VLM inference with vLLM.
Adapting prompts when switching between model families.
Building multi-turn visual conversations with chat-style VLMs.

Code Reference

Source Location

Repository: vllm
File: examples/offline_inference/vision_language.py (prompt patterns per model)
External: transformers.AutoTokenizer.apply_chat_template()

Signature

# Using AutoTokenizer chat template (preferred for models that support it)
AutoTokenizer.apply_chat_template(
    conversation: list[dict[str, str]],  # [{"role": "user", "content": "..."}]
    tokenize: bool = True,               # Set False for string output
    add_generation_prompt: bool = False,  # Set True to add assistant marker
) -> str | list[int]

# Direct string construction pattern (model-specific)
prompt = f"<role_marker>{vision_placeholder}\n{question}<end_marker>"

Import

from transformers import AutoTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
conversation	`list[dict[str, str]]`	Yes	List of message dicts with `"role"` and `"content"` keys; content includes vision placeholder tokens
tokenize	`bool`	No	If `False`, returns string instead of token IDs (default: `True`)
add_generation_prompt	`bool`	No	If `True`, appends assistant turn marker (default: `False`)

Outputs

Name	Type	Description
prompt	`str`	Fully formatted prompt string with vision placeholders, role markers, and turn delimiters

Usage Examples

LLaVA-1.5: Direct String Construction

# LLaVA-1.5 uses a simple USER/ASSISTANT format with <image> token
question = "What is the content of this image?"
prompt = f"USER: <image>\n{question}\nASSISTANT:"

# Result: "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

Qwen2.5-VL: Direct ChatML-style Construction

# Qwen2.5-VL uses ChatML format with vision_start/vision_end wrappers
question = "Describe this image."
prompt = (
    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
    "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
    f"{question}<|im_end|>\n"
    "<|im_start|>assistant\n"
)

Phi-3.5-Vision: Numbered Image Tokens

# Phi-3-Vision uses numbered image tokens: <|image_1|>, <|image_2|>, etc.
question = "What is shown in this image?"
prompt = f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n"

InternVL: Using apply_chat_template

from transformers import AutoTokenizer

model_name = "OpenGVLab/InternVL3-2B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

question = "Describe the content of this image in detail."
messages = [{"role": "user", "content": f"<image>\n{question}"}]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Stop tokens for InternVL
stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
stop_token_ids = [tokenizer.convert_tokens_to_ids(t) for t in stop_tokens]
stop_token_ids = [tid for tid in stop_token_ids if tid is not None]

Mistral/Pixtral: INST Format

# Mistral-based VLMs use [INST] format with [IMG] token
question = "What do you see in this image?"
prompt = f"<s>[INST]{question}\n[IMG][/INST]"

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Multimodal_Prompt_Formatting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment