Implementation:Haotian liu LLaVA Tokenizer Image Token
Overview
Utilities for constructing conversation prompts and tokenizing them with image token placeholders. Two components: the Conversation class for prompt formatting and tokenizer_image_token() for image-aware tokenization.
Sources
- File:
llava/conversation.py, Lines: L19-208 (Conversation class), L373-392 (conv_templates) - File:
llava/mm_utils.py, Lines: L185-204 (tokenizer_image_token)
Signatures
Conversation Class
@dataclasses.dataclass
class Conversation:
system: str # System prompt message
roles: List[str] # Role names, e.g., ["USER", "ASSISTANT"]
messages: List[List[str]] # List of [role, message] pairs
offset: int # Number of initial messages to skip in prompt
sep_style: SeparatorStyle # Separator style enum
sep: str # Primary separator string
sep2: str # Secondary separator (for TWO style)
version: str # Template version identifier
def get_prompt(self) -> str:
"""Generate the full prompt string from system message and all messages."""
def append_message(self, role: str, message: str):
"""Append a new message with the given role."""
def copy(self) -> 'Conversation':
"""Return a deep copy of this conversation."""
tokenizer_image_token
def tokenizer_image_token(
prompt: str,
tokenizer,
image_token_index: int = IMAGE_TOKEN_INDEX,
return_tensors: str = None
) -> Union[List[int], torch.Tensor]:
"""
Tokenize a prompt, replacing <image> occurrences with image_token_index.
Args:
prompt: The text prompt containing <image> placeholders.
tokenizer: HuggingFace tokenizer instance.
image_token_index: Token index to insert at image positions (default: -200).
return_tensors: If 'pt', return a PyTorch tensor; otherwise return a list.
Returns:
Tokenized input_ids with image token placeholders.
"""
Import
from llava.conversation import conv_templates, SeparatorStyle
from llava.mm_utils import tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX
Inputs
| Parameter | Type | Description |
|---|---|---|
prompt |
str | Text prompt with <image> placeholders
|
tokenizer |
AutoTokenizer |
HuggingFace tokenizer for the model |
image_token_index |
int | Index to insert at image positions (default: -200)
|
return_tensors |
str | 'pt' for PyTorch tensor, None for list
|
conv_mode |
str | Name of conversation template (for conv_templates lookup)
|
Outputs
- From
Conversation.get_prompt(): Formatted prompt string with role labels, separators, and system message. - From
tokenizer_image_token(): Tokenizedinput_ids(List[int]ortorch.Tensor) where<image>positions are replaced withIMAGE_TOKEN_INDEX(-200).
Usage Example
from llava.conversation import conv_templates
from llava.mm_utils import tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
# 1. Copy the appropriate conversation template
conv = conv_templates["llava_v1"].copy()
# 2. Append user message with image token
user_message = DEFAULT_IMAGE_TOKEN + "\nDescribe this image in detail."
conv.append_message(conv.roles[0], user_message)
# 3. Append empty assistant message (to be generated)
conv.append_message(conv.roles[1], None)
# 4. Get the formatted prompt
prompt = conv.get_prompt()
# Result: "A chat between a curious human and an artificial intelligence assistant..."
# "USER: <image>\nDescribe this image in detail. ASSISTANT:"
# 5. Tokenize with image token placeholder
input_ids = tokenizer_image_token(
prompt,
tokenizer,
IMAGE_TOKEN_INDEX,
return_tensors='pt'
).unsqueeze(0).cuda()
# input_ids contains -200 at the position where <image> was
Description
Conversation class:
The Conversation class uses a get_prompt() method that formats all messages according to the configured SeparatorStyle. Pre-defined templates are stored in conv_templates dict and accessed by name (e.g., "llava_v1", "llava_llama_2", "mpt"). Always use .copy() to get a fresh instance for each conversation.
tokenizer_image_token():
This function handles the special case where <image> must be replaced with a single token index rather than being tokenized as text. It:
- Splits the prompt at
<image>occurrences - Tokenizes each text segment independently
- Inserts
IMAGE_TOKEN_INDEXbetween the tokenized segments - Optionally wraps the result in a PyTorch tensor
This approach ensures that the image placeholder does not interfere with the tokenizer's BPE encoding of surrounding text.
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Repo - LLaVA - https://github.com/haotian-liu/LLaVA |
| Domains | NLP, Prompt_Engineering |
| Last Updated | 2026-02-13 14:00 GMT |