Principle:Haotian liu LLaVA Conversation Prompt Construction

Overview

Technique for formatting multimodal prompts using conversation templates with embedded image token placeholders.

Description

Conversation prompt construction combines two subsystems:

Conversation Template System

The Conversation class manages multi-turn dialogue with:

System message -- An optional system prompt that defines the assistant's behavior.
Role-based message formatting -- Messages are tagged with roles (e.g., USER, ASSISTANT) according to the template.
Separator styles -- Different LLMs expect different prompt formats, abstracted through SeparatorStyle:
- SINGLE -- Single separator between turns (e.g., Vicuna: "USER: ... ASSISTANT: ...")
- TWO -- Two different separators for user and assistant turns
- MPT -- MPT-style with "<|im_start|>...<|im_end|>" tags
- PLAIN -- No special formatting (for pretraining)
- LLAMA_2 -- LLaMA-2 instruction format with "[INST] ... [/INST]"

Image Token Injection

After prompt construction, tokenizer_image_token() tokenizes the prompt while replacing the <image> text placeholder with the special IMAGE_TOKEN_INDEX (-200). This sentinel value is later replaced during the model's forward pass by actual visual embeddings from the CLIP encoder and MLP projector.

Usage

Use whenever preparing input for LLaVA inference. The conversation mode is auto-detected from the model name:

Model Name Contains	Conversation Mode
`'llava-v1.6'` or `'llava-v1.5'`	`llava_v1`
`'llava-llama-2'`	`llava_llama_2`
`'mistral'`	`mistral_instruct`
`'mpt'`	`mpt`

Theoretical Basis

Different LLMs expect different prompt formats. The template system abstracts this variation:

Vicuna uses "USER: {text} ASSISTANT: {response}"
LLaMA-2 uses "[INST] {text} [/INST] {response}"
MPT uses "<|im_start|>user\n{text}<|im_end|><|im_start|>assistant\n{response}<|im_end|>"

The image token (IMAGE_TOKEN_INDEX = -200) is a sentinel value in the tokenized input. During the forward pass, prepare_inputs_labels_for_multimodal() detects positions with this index and replaces them with visual embeddings. This decouples prompt construction from visual encoding, allowing the same tokenization logic to work regardless of whether images are present.

The <image> placeholder is inserted at the beginning of the first user message, before the user's text query.

Metadata

Field	Value
Knowledge Sources	Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains	NLP, Prompt_Engineering
Last Updated	2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment