Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm VLM Prompt Template

From Leeroopedia


Knowledge Sources
Domains Prompt Engineering, Vision Language Models, Tokenization
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for constructing VLM-compatible prompts with vision token placeholders, provided by HuggingFace Transformers' chat template system and vLLM's example prompt patterns.

Description

VLM prompt formatting in vLLM uses two primary approaches:

  1. Direct string construction: Building prompt strings manually with the exact special tokens, role markers, and vision placeholders for each model. This approach is used by most runner functions in vision_language.py.
  2. AutoTokenizer.apply_chat_template(): Using HuggingFace's tokenizer-based template system to automatically format structured message lists into the correct prompt string. Several models (InternVL, Eagle2.5, H2OVL, Ovis, etc.) use this approach.

The apply_chat_template method accepts a list of message dictionaries with role and content keys, applies the model's Jinja2-based chat template, and returns the formatted prompt string. The tokenize=False parameter returns the string (not token IDs), and add_generation_prompt=True appends the assistant turn marker.

Usage

Use VLM prompt templates when:

  • Formatting prompts for any VLM inference with vLLM.
  • Adapting prompts when switching between model families.
  • Building multi-turn visual conversations with chat-style VLMs.

Code Reference

Source Location

  • Repository: vllm
  • File: examples/offline_inference/vision_language.py (prompt patterns per model)
  • External: transformers.AutoTokenizer.apply_chat_template()

Signature

# Using AutoTokenizer chat template (preferred for models that support it)
AutoTokenizer.apply_chat_template(
    conversation: list[dict[str, str]],  # [{"role": "user", "content": "..."}]
    tokenize: bool = True,               # Set False for string output
    add_generation_prompt: bool = False,  # Set True to add assistant marker
) -> str | list[int]

# Direct string construction pattern (model-specific)
prompt = f"<role_marker>{vision_placeholder}\n{question}<end_marker>"

Import

from transformers import AutoTokenizer

I/O Contract

Inputs

Name Type Required Description
conversation list[dict[str, str]] Yes List of message dicts with "role" and "content" keys; content includes vision placeholder tokens
tokenize bool No If False, returns string instead of token IDs (default: True)
add_generation_prompt bool No If True, appends assistant turn marker (default: False)

Outputs

Name Type Description
prompt str Fully formatted prompt string with vision placeholders, role markers, and turn delimiters

Usage Examples

LLaVA-1.5: Direct String Construction

# LLaVA-1.5 uses a simple USER/ASSISTANT format with <image> token
question = "What is the content of this image?"
prompt = f"USER: <image>\n{question}\nASSISTANT:"

# Result: "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

Qwen2.5-VL: Direct ChatML-style Construction

# Qwen2.5-VL uses ChatML format with vision_start/vision_end wrappers
question = "Describe this image."
prompt = (
    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
    "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
    f"{question}<|im_end|>\n"
    "<|im_start|>assistant\n"
)

Phi-3.5-Vision: Numbered Image Tokens

# Phi-3-Vision uses numbered image tokens: <|image_1|>, <|image_2|>, etc.
question = "What is shown in this image?"
prompt = f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n"

InternVL: Using apply_chat_template

from transformers import AutoTokenizer

model_name = "OpenGVLab/InternVL3-2B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

question = "Describe the content of this image in detail."
messages = [{"role": "user", "content": f"<image>\n{question}"}]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Stop tokens for InternVL
stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
stop_token_ids = [tokenizer.convert_tokens_to_ids(t) for t in stop_tokens]
stop_token_ids = [tid for tid in stop_token_ids if tid is not None]

Mistral/Pixtral: INST Format

# Mistral-based VLMs use [INST] format with [IMG] token
question = "What do you see in this image?"
prompt = f"<s>[INST]{question}\n[IMG][/INST]"

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment