Implementation:Turboderp org Exllamav2 Tokenizer Encode Multimodal
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language_Models, Tokenization, Multimodal |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for encoding text prompts containing image placeholders into token ID sequences with multimodal embedding substitution, provided by exllamav2.
Description
The encode() method on ExLlamaV2Tokenizer supports an optional embeddings parameter that enables multimodal prompt encoding. When embeddings are provided, the method:
- Scans the input text for text_alias strings from each embedding container
- Splits the text at alias boundaries
- Tokenizes each text segment normally
- Substitutes the alias positions with the allocated token ID ranges from the corresponding ExLlamaV2MMEmbedding objects
- Concatenates all segments into a single token ID tensor
This produces a token ID tensor where standard text tokens coexist with special multimodal token IDs. During the model's forward pass, the multimodal token IDs are intercepted and replaced with the actual vision embeddings.
Usage
Use this method when encoding prompts that contain image references for vision-language model inference. The embeddings parameter should contain all ExLlamaV2MMEmbedding objects whose text aliases appear in the prompt text.
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/tokenizer/tokenizer.py
- Lines: L415-475
Signature
def encode(
self,
text: str,
add_bos: bool = True,
encode_special_tokens: bool = True,
embeddings: list[ExLlamaV2MMEmbedding] | None = None
) -> torch.Tensor:
...
Import
from exllamav2 import ExLlamaV2Tokenizer
# encode is a method on ExLlamaV2Tokenizer instances
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | Prompt text containing image placeholder aliases (e.g., "<image>") at positions where image embeddings should be inserted |
| add_bos | bool | No | Whether to prepend the beginning-of-sequence token; default True |
| encode_special_tokens | bool | No | Whether to encode special tokens in the text; default True |
| embeddings | list[ExLlamaV2MMEmbedding] or None | No | List of multimodal embedding containers whose text_alias strings appear in the prompt text; None for text-only encoding |
Outputs
| Name | Type | Description |
|---|---|---|
| token_ids | torch.Tensor | Token ID tensor of shape (1, seq_len) with standard vocabulary IDs for text portions and allocated multimodal token IDs for image placeholder positions |
Usage Examples
Basic
from PIL import Image
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer
# Assume model and tokenizer are loaded
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)
# Get image embeddings
image = Image.open("/path/to/image.jpg")
embedding = model.vision_model.get_image_embeddings(
model=model,
tokenizer=tokenizer,
image=image
)
# Encode prompt with image placeholder
prompt = f"Describe this image: {embedding.text_alias}\nWhat do you see?"
input_ids = tokenizer.encode(
prompt,
embeddings=[embedding]
)
# input_ids now contains multimodal token IDs at the image position
Multiple Images
# Process two images
image1 = Image.open("/path/to/image1.jpg")
image2 = Image.open("/path/to/image2.jpg")
emb1 = model.vision_model.get_image_embeddings(
model=model, tokenizer=tokenizer, image=image1,
text_alias="<image_1>"
)
emb2 = model.vision_model.get_image_embeddings(
model=model, tokenizer=tokenizer, image=image2,
text_alias="<image_2>"
)
prompt = f"Compare these images: <image_1> and <image_2>"
input_ids = tokenizer.encode(
prompt,
embeddings=[emb1, emb2]
)