Implementation:Vllm project Vllm VLM Prompt Template
| Knowledge Sources | |
|---|---|
| Domains | Prompt Engineering, Vision Language Models, Tokenization |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for constructing VLM-compatible prompts with vision token placeholders, provided by HuggingFace Transformers' chat template system and vLLM's example prompt patterns.
Description
VLM prompt formatting in vLLM uses two primary approaches:
- Direct string construction: Building prompt strings manually with the exact special tokens, role markers, and vision placeholders for each model. This approach is used by most runner functions in
vision_language.py. AutoTokenizer.apply_chat_template(): Using HuggingFace's tokenizer-based template system to automatically format structured message lists into the correct prompt string. Several models (InternVL, Eagle2.5, H2OVL, Ovis, etc.) use this approach.
The apply_chat_template method accepts a list of message dictionaries with role and content keys, applies the model's Jinja2-based chat template, and returns the formatted prompt string. The tokenize=False parameter returns the string (not token IDs), and add_generation_prompt=True appends the assistant turn marker.
Usage
Use VLM prompt templates when:
- Formatting prompts for any VLM inference with vLLM.
- Adapting prompts when switching between model families.
- Building multi-turn visual conversations with chat-style VLMs.
Code Reference
Source Location
- Repository: vllm
- File:
examples/offline_inference/vision_language.py(prompt patterns per model) - External:
transformers.AutoTokenizer.apply_chat_template()
Signature
# Using AutoTokenizer chat template (preferred for models that support it)
AutoTokenizer.apply_chat_template(
conversation: list[dict[str, str]], # [{"role": "user", "content": "..."}]
tokenize: bool = True, # Set False for string output
add_generation_prompt: bool = False, # Set True to add assistant marker
) -> str | list[int]
# Direct string construction pattern (model-specific)
prompt = f"<role_marker>{vision_placeholder}\n{question}<end_marker>"
Import
from transformers import AutoTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| conversation | list[dict[str, str]] |
Yes | List of message dicts with "role" and "content" keys; content includes vision placeholder tokens
|
| tokenize | bool |
No | If False, returns string instead of token IDs (default: True)
|
| add_generation_prompt | bool |
No | If True, appends assistant turn marker (default: False)
|
Outputs
| Name | Type | Description |
|---|---|---|
| prompt | str |
Fully formatted prompt string with vision placeholders, role markers, and turn delimiters |
Usage Examples
LLaVA-1.5: Direct String Construction
# LLaVA-1.5 uses a simple USER/ASSISTANT format with <image> token
question = "What is the content of this image?"
prompt = f"USER: <image>\n{question}\nASSISTANT:"
# Result: "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
Qwen2.5-VL: Direct ChatML-style Construction
# Qwen2.5-VL uses ChatML format with vision_start/vision_end wrappers
question = "Describe this image."
prompt = (
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n"
)
Phi-3.5-Vision: Numbered Image Tokens
# Phi-3-Vision uses numbered image tokens: <|image_1|>, <|image_2|>, etc.
question = "What is shown in this image?"
prompt = f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n"
InternVL: Using apply_chat_template
from transformers import AutoTokenizer
model_name = "OpenGVLab/InternVL3-2B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
question = "Describe the content of this image in detail."
messages = [{"role": "user", "content": f"<image>\n{question}"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# Stop tokens for InternVL
stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
stop_token_ids = [tokenizer.convert_tokens_to_ids(t) for t in stop_tokens]
stop_token_ids = [tid for tid in stop_token_ids if tid is not None]
Mistral/Pixtral: INST Format
# Mistral-based VLMs use [INST] format with [IMG] token
question = "What do you see in this image?"
prompt = f"<s>[INST]{question}\n[IMG][/INST]"