Implementation:Haotian liu LLaVA Tokenizer Image Token

Overview

Utilities for constructing conversation prompts and tokenizing them with image token placeholders. Two components: the Conversation class for prompt formatting and tokenizer_image_token() for image-aware tokenization.

Sources

File: llava/conversation.py, Lines: L19-208 (Conversation class), L373-392 (conv_templates)
File: llava/mm_utils.py, Lines: L185-204 (tokenizer_image_token)

Signatures

Conversation Class

@dataclasses.dataclass
class Conversation:
    system: str                    # System prompt message
    roles: List[str]               # Role names, e.g., ["USER", "ASSISTANT"]
    messages: List[List[str]]      # List of [role, message] pairs
    offset: int                    # Number of initial messages to skip in prompt
    sep_style: SeparatorStyle      # Separator style enum
    sep: str                       # Primary separator string
    sep2: str                      # Secondary separator (for TWO style)
    version: str                   # Template version identifier

    def get_prompt(self) -> str:
        """Generate the full prompt string from system message and all messages."""

    def append_message(self, role: str, message: str):
        """Append a new message with the given role."""

    def copy(self) -> 'Conversation':
        """Return a deep copy of this conversation."""

tokenizer_image_token

def tokenizer_image_token(
    prompt: str,
    tokenizer,
    image_token_index: int = IMAGE_TOKEN_INDEX,
    return_tensors: str = None
) -> Union[List[int], torch.Tensor]:
    """
    Tokenize a prompt, replacing <image> occurrences with image_token_index.

    Args:
        prompt: The text prompt containing <image> placeholders.
        tokenizer: HuggingFace tokenizer instance.
        image_token_index: Token index to insert at image positions (default: -200).
        return_tensors: If 'pt', return a PyTorch tensor; otherwise return a list.

    Returns:
        Tokenized input_ids with image token placeholders.
    """

Import

from llava.conversation import conv_templates, SeparatorStyle
from llava.mm_utils import tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX

Inputs

Parameter	Type	Description
`prompt`	str	Text prompt with `<image>` placeholders
`tokenizer`	`AutoTokenizer`	HuggingFace tokenizer for the model
`image_token_index`	int	Index to insert at image positions (default: `-200`)
`return_tensors`	str	`'pt'` for PyTorch tensor, `None` for list
`conv_mode`	str	Name of conversation template (for `conv_templates` lookup)

Outputs

From Conversation.get_prompt(): Formatted prompt string with role labels, separators, and system message.
From tokenizer_image_token(): Tokenized input_ids (List[int] or torch.Tensor) where <image> positions are replaced with IMAGE_TOKEN_INDEX (-200).

Usage Example

from llava.conversation import conv_templates
from llava.mm_utils import tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN

# 1. Copy the appropriate conversation template
conv = conv_templates["llava_v1"].copy()

# 2. Append user message with image token
user_message = DEFAULT_IMAGE_TOKEN + "\nDescribe this image in detail."
conv.append_message(conv.roles[0], user_message)

# 3. Append empty assistant message (to be generated)
conv.append_message(conv.roles[1], None)

# 4. Get the formatted prompt
prompt = conv.get_prompt()
# Result: "A chat between a curious human and an artificial intelligence assistant..."
#         "USER: <image>\nDescribe this image in detail. ASSISTANT:"

# 5. Tokenize with image token placeholder
input_ids = tokenizer_image_token(
    prompt,
    tokenizer,
    IMAGE_TOKEN_INDEX,
    return_tensors='pt'
).unsqueeze(0).cuda()
# input_ids contains -200 at the position where <image> was

Description

Conversation class:

The Conversation class uses a get_prompt() method that formats all messages according to the configured SeparatorStyle. Pre-defined templates are stored in conv_templates dict and accessed by name (e.g., "llava_v1", "llava_llama_2", "mpt"). Always use .copy() to get a fresh instance for each conversation.

tokenizer_image_token():

This function handles the special case where <image> must be replaced with a single token index rather than being tokenized as text. It:

Splits the prompt at <image> occurrences
Tokenizes each text segment independently
Inserts IMAGE_TOKEN_INDEX between the tokenized segments
Optionally wraps the result in a PyTorch tensor

This approach ensures that the image placeholder does not interfere with the tokenizer's BPE encoding of surrounding text.

Metadata

Field	Value
Knowledge Sources	Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains	NLP, Prompt_Engineering
Last Updated	2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment