Implementation:Turboderp org Exllamav2 Tokenizer Encode Multimodal

Knowledge Sources	ExLlamaV2
Domains	Vision_Language_Models, Tokenization, Multimodal
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for encoding text prompts containing image placeholders into token ID sequences with multimodal embedding substitution, provided by exllamav2.

Description

The encode() method on ExLlamaV2Tokenizer supports an optional embeddings parameter that enables multimodal prompt encoding. When embeddings are provided, the method:

Scans the input text for text_alias strings from each embedding container
Splits the text at alias boundaries
Tokenizes each text segment normally
Substitutes the alias positions with the allocated token ID ranges from the corresponding ExLlamaV2MMEmbedding objects
Concatenates all segments into a single token ID tensor

This produces a token ID tensor where standard text tokens coexist with special multimodal token IDs. During the model's forward pass, the multimodal token IDs are intercepted and replaced with the actual vision embeddings.

Usage

Use this method when encoding prompts that contain image references for vision-language model inference. The embeddings parameter should contain all ExLlamaV2MMEmbedding objects whose text aliases appear in the prompt text.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/tokenizer/tokenizer.py
Lines: L415-475

Signature

def encode(
    self,
    text: str,
    add_bos: bool = True,
    encode_special_tokens: bool = True,
    embeddings: list[ExLlamaV2MMEmbedding] | None = None
) -> torch.Tensor:
    ...

Import

from exllamav2 import ExLlamaV2Tokenizer
# encode is a method on ExLlamaV2Tokenizer instances

I/O Contract

Inputs

Name	Type	Required	Description
text	str	Yes	Prompt text containing image placeholder aliases (e.g., "<image>") at positions where image embeddings should be inserted
add_bos	bool	No	Whether to prepend the beginning-of-sequence token; default True
encode_special_tokens	bool	No	Whether to encode special tokens in the text; default True
embeddings	list[ExLlamaV2MMEmbedding] or None	No	List of multimodal embedding containers whose text_alias strings appear in the prompt text; None for text-only encoding

Outputs

Name	Type	Description
token_ids	torch.Tensor	Token ID tensor of shape (1, seq_len) with standard vocabulary IDs for text portions and allocated multimodal token IDs for image placeholder positions

Usage Examples

Basic

from PIL import Image
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer

# Assume model and tokenizer are loaded
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)

# Get image embeddings
image = Image.open("/path/to/image.jpg")
embedding = model.vision_model.get_image_embeddings(
    model=model,
    tokenizer=tokenizer,
    image=image
)

# Encode prompt with image placeholder
prompt = f"Describe this image: {embedding.text_alias}\nWhat do you see?"
input_ids = tokenizer.encode(
    prompt,
    embeddings=[embedding]
)
# input_ids now contains multimodal token IDs at the image position

Multiple Images

# Process two images
image1 = Image.open("/path/to/image1.jpg")
image2 = Image.open("/path/to/image2.jpg")

emb1 = model.vision_model.get_image_embeddings(
    model=model, tokenizer=tokenizer, image=image1,
    text_alias="<image_1>"
)
emb2 = model.vision_model.get_image_embeddings(
    model=model, tokenizer=tokenizer, image=image2,
    text_alias="<image_2>"
)

prompt = f"Compare these images: <image_1> and <image_2>"
input_ids = tokenizer.encode(
    prompt,
    embeddings=[emb1, emb2]
)

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Multimodal_Prompt_Encoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment