Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Tokenizer Encode Multimodal

From Leeroopedia
Knowledge Sources
Domains Vision_Language_Models, Tokenization, Multimodal
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for encoding text prompts containing image placeholders into token ID sequences with multimodal embedding substitution, provided by exllamav2.

Description

The encode() method on ExLlamaV2Tokenizer supports an optional embeddings parameter that enables multimodal prompt encoding. When embeddings are provided, the method:

  1. Scans the input text for text_alias strings from each embedding container
  2. Splits the text at alias boundaries
  3. Tokenizes each text segment normally
  4. Substitutes the alias positions with the allocated token ID ranges from the corresponding ExLlamaV2MMEmbedding objects
  5. Concatenates all segments into a single token ID tensor

This produces a token ID tensor where standard text tokens coexist with special multimodal token IDs. During the model's forward pass, the multimodal token IDs are intercepted and replaced with the actual vision embeddings.

Usage

Use this method when encoding prompts that contain image references for vision-language model inference. The embeddings parameter should contain all ExLlamaV2MMEmbedding objects whose text aliases appear in the prompt text.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/tokenizer/tokenizer.py
  • Lines: L415-475

Signature

def encode(
    self,
    text: str,
    add_bos: bool = True,
    encode_special_tokens: bool = True,
    embeddings: list[ExLlamaV2MMEmbedding] | None = None
) -> torch.Tensor:
    ...

Import

from exllamav2 import ExLlamaV2Tokenizer
# encode is a method on ExLlamaV2Tokenizer instances

I/O Contract

Inputs

Name Type Required Description
text str Yes Prompt text containing image placeholder aliases (e.g., "<image>") at positions where image embeddings should be inserted
add_bos bool No Whether to prepend the beginning-of-sequence token; default True
encode_special_tokens bool No Whether to encode special tokens in the text; default True
embeddings list[ExLlamaV2MMEmbedding] or None No List of multimodal embedding containers whose text_alias strings appear in the prompt text; None for text-only encoding

Outputs

Name Type Description
token_ids torch.Tensor Token ID tensor of shape (1, seq_len) with standard vocabulary IDs for text portions and allocated multimodal token IDs for image placeholder positions

Usage Examples

Basic

from PIL import Image
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer

# Assume model and tokenizer are loaded
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)

# Get image embeddings
image = Image.open("/path/to/image.jpg")
embedding = model.vision_model.get_image_embeddings(
    model=model,
    tokenizer=tokenizer,
    image=image
)

# Encode prompt with image placeholder
prompt = f"Describe this image: {embedding.text_alias}\nWhat do you see?"
input_ids = tokenizer.encode(
    prompt,
    embeddings=[embedding]
)
# input_ids now contains multimodal token IDs at the image position

Multiple Images

# Process two images
image1 = Image.open("/path/to/image1.jpg")
image2 = Image.open("/path/to/image2.jpg")

emb1 = model.vision_model.get_image_embeddings(
    model=model, tokenizer=tokenizer, image=image1,
    text_alias="<image_1>"
)
emb2 = model.vision_model.get_image_embeddings(
    model=model, tokenizer=tokenizer, image=image2,
    text_alias="<image_2>"
)

prompt = f"Compare these images: <image_1> and <image_2>"
input_ids = tokenizer.encode(
    prompt,
    embeddings=[emb1, emb2]
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment