Principle:Haotian liu LLaVA Conversation Prompt Construction
Overview
Technique for formatting multimodal prompts using conversation templates with embedded image token placeholders.
Description
Conversation prompt construction combines two subsystems:
Conversation Template System
The Conversation class manages multi-turn dialogue with:
- System message -- An optional system prompt that defines the assistant's behavior.
- Role-based message formatting -- Messages are tagged with roles (e.g.,
USER,ASSISTANT) according to the template. - Separator styles -- Different LLMs expect different prompt formats, abstracted through
SeparatorStyle:- SINGLE -- Single separator between turns (e.g., Vicuna:
"USER: ... ASSISTANT: ...") - TWO -- Two different separators for user and assistant turns
- MPT -- MPT-style with
"<|im_start|>...<|im_end|>"tags - PLAIN -- No special formatting (for pretraining)
- LLAMA_2 -- LLaMA-2 instruction format with
"[INST] ... [/INST]"
- SINGLE -- Single separator between turns (e.g., Vicuna:
Image Token Injection
After prompt construction, tokenizer_image_token() tokenizes the prompt while replacing the <image> text placeholder with the special IMAGE_TOKEN_INDEX (-200). This sentinel value is later replaced during the model's forward pass by actual visual embeddings from the CLIP encoder and MLP projector.
Usage
Use whenever preparing input for LLaVA inference. The conversation mode is auto-detected from the model name:
| Model Name Contains | Conversation Mode |
|---|---|
'llava-v1.6' or 'llava-v1.5' |
llava_v1
|
'llava-llama-2' |
llava_llama_2
|
'mistral' |
mistral_instruct
|
'mpt' |
mpt
|
Theoretical Basis
Different LLMs expect different prompt formats. The template system abstracts this variation:
- Vicuna uses
"USER: {text} ASSISTANT: {response}" - LLaMA-2 uses
"[INST] {text} [/INST] {response}" - MPT uses
"<|im_start|>user\n{text}<|im_end|><|im_start|>assistant\n{response}<|im_end|>"
The image token (IMAGE_TOKEN_INDEX = -200) is a sentinel value in the tokenized input. During the forward pass, prepare_inputs_labels_for_multimodal() detects positions with this index and replaces them with visual embeddings. This decouples prompt construction from visual encoding, allowing the same tokenization logic to work regardless of whether images are present.
The <image> placeholder is inserted at the beginning of the first user message, before the user's text query.
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Repo - LLaVA - https://github.com/haotian-liu/LLaVA |
| Domains | NLP, Prompt_Engineering |
| Last Updated | 2026-02-13 14:00 GMT |