Implementation:Turboderp org Exllamav2 Get Image Embeddings

Knowledge Sources	ExLlamaV2
Domains	Vision_Language_Models, Image_Processing, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for converting a PIL image into language model embeddings through the vision tower pipeline, provided by exllamav2.

Description

The get_image_embeddings() method on ExLlamaV2VisionTower takes a PIL image and processes it through the full vision pipeline: preprocessing, vision encoder forward pass, and multimodal projection. It returns an ExLlamaV2MMEmbedding container holding the resulting embeddings along with metadata needed for prompt integration.

The method handles:

Architecture-specific image preprocessing (Pixtral, Qwen2-VL, SigLIP)
Running the vision transformer forward pass on the preprocessed image tensor
Projecting features through the multimodal projector
Allocating token IDs for the embedding sequence
Generating or using a provided text alias for prompt placeholder substitution
Optionally moving embeddings to CPU for memory-efficient caching

Usage

Use this method for each image that needs to be included in a multimodal prompt. The returned ExLlamaV2MMEmbedding object is then passed to the tokenizer's encode method (via the embeddings parameter) and to generation jobs.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/vlm/vision_tower.py
Lines: L344-418

Signature

def get_image_embeddings(
    self,
    model: ExLlamaV2,
    tokenizer: ExLlamaV2Tokenizer,
    image: PIL.Image.Image,
    text_alias: str | None = None,
    embeddings_cpu: bool = True
) -> ExLlamaV2MMEmbedding:
    ...

Import

from exllamav2 import ExLlamaV2VisionTower
# get_image_embeddings is a method on ExLlamaV2VisionTower instances

I/O Contract

Inputs

Name	Type	Required	Description
model	ExLlamaV2	Yes	The loaded text (language) model instance
tokenizer	ExLlamaV2Tokenizer	Yes	The tokenizer for the language model, used to allocate token IDs for the embedding sequence
image	PIL.Image.Image	Yes	The input image to process through the vision pipeline
text_alias	str or None	No	Placeholder text string to represent this image in the prompt; auto-generated if None
embeddings_cpu	bool	No	Whether to move the resulting embeddings to CPU for memory-efficient caching; default True

Outputs

Name	Type	Description
embedding	ExLlamaV2MMEmbedding	Multimodal embedding container with embeddings tensor of shape (num_tokens, hidden_size), allocated token IDs, and text_alias for prompt substitution

Usage Examples

Basic

from PIL import Image
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer

# Assume model, tokenizer loaded with vision support
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)

# Load an image and extract embeddings
image = Image.open("/path/to/image.jpg")
vision_model = model.vision_model

embedding = vision_model.get_image_embeddings(
    model=model,
    tokenizer=tokenizer,
    image=image
)

# The embedding.text_alias can now be used in prompts
print(f"Use '{embedding.text_alias}' in your prompt to reference this image")

With Custom Alias

embedding = vision_model.get_image_embeddings(
    model=model,
    tokenizer=tokenizer,
    image=image,
    text_alias="<image_1>",
    embeddings_cpu=False  # Keep on GPU for immediate use
)

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Image_Embedding_Extraction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment