Implementation:Turboderp org Exllamav2 Get Image Embeddings
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language_Models, Image_Processing, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for converting a PIL image into language model embeddings through the vision tower pipeline, provided by exllamav2.
Description
The get_image_embeddings() method on ExLlamaV2VisionTower takes a PIL image and processes it through the full vision pipeline: preprocessing, vision encoder forward pass, and multimodal projection. It returns an ExLlamaV2MMEmbedding container holding the resulting embeddings along with metadata needed for prompt integration.
The method handles:
- Architecture-specific image preprocessing (Pixtral, Qwen2-VL, SigLIP)
- Running the vision transformer forward pass on the preprocessed image tensor
- Projecting features through the multimodal projector
- Allocating token IDs for the embedding sequence
- Generating or using a provided text alias for prompt placeholder substitution
- Optionally moving embeddings to CPU for memory-efficient caching
Usage
Use this method for each image that needs to be included in a multimodal prompt. The returned ExLlamaV2MMEmbedding object is then passed to the tokenizer's encode method (via the embeddings parameter) and to generation jobs.
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/vlm/vision_tower.py
- Lines: L344-418
Signature
def get_image_embeddings(
self,
model: ExLlamaV2,
tokenizer: ExLlamaV2Tokenizer,
image: PIL.Image.Image,
text_alias: str | None = None,
embeddings_cpu: bool = True
) -> ExLlamaV2MMEmbedding:
...
Import
from exllamav2 import ExLlamaV2VisionTower
# get_image_embeddings is a method on ExLlamaV2VisionTower instances
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | ExLlamaV2 | Yes | The loaded text (language) model instance |
| tokenizer | ExLlamaV2Tokenizer | Yes | The tokenizer for the language model, used to allocate token IDs for the embedding sequence |
| image | PIL.Image.Image | Yes | The input image to process through the vision pipeline |
| text_alias | str or None | No | Placeholder text string to represent this image in the prompt; auto-generated if None |
| embeddings_cpu | bool | No | Whether to move the resulting embeddings to CPU for memory-efficient caching; default True |
Outputs
| Name | Type | Description |
|---|---|---|
| embedding | ExLlamaV2MMEmbedding | Multimodal embedding container with embeddings tensor of shape (num_tokens, hidden_size), allocated token IDs, and text_alias for prompt substitution |
Usage Examples
Basic
from PIL import Image
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer
# Assume model, tokenizer loaded with vision support
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)
# Load an image and extract embeddings
image = Image.open("/path/to/image.jpg")
vision_model = model.vision_model
embedding = vision_model.get_image_embeddings(
model=model,
tokenizer=tokenizer,
image=image
)
# The embedding.text_alias can now be used in prompts
print(f"Use '{embedding.text_alias}' in your prompt to reference this image")
With Custom Alias
embedding = vision_model.get_image_embeddings(
model=model,
tokenizer=tokenizer,
image=image,
text_alias="<image_1>",
embeddings_cpu=False # Keep on GPU for immediate use
)