Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Get Image Embeddings

From Leeroopedia
Revision as of 14:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Turboderp_org_Exllamav2_Get_Image_Embeddings.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Vision_Language_Models, Image_Processing, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for converting a PIL image into language model embeddings through the vision tower pipeline, provided by exllamav2.

Description

The get_image_embeddings() method on ExLlamaV2VisionTower takes a PIL image and processes it through the full vision pipeline: preprocessing, vision encoder forward pass, and multimodal projection. It returns an ExLlamaV2MMEmbedding container holding the resulting embeddings along with metadata needed for prompt integration.

The method handles:

  • Architecture-specific image preprocessing (Pixtral, Qwen2-VL, SigLIP)
  • Running the vision transformer forward pass on the preprocessed image tensor
  • Projecting features through the multimodal projector
  • Allocating token IDs for the embedding sequence
  • Generating or using a provided text alias for prompt placeholder substitution
  • Optionally moving embeddings to CPU for memory-efficient caching

Usage

Use this method for each image that needs to be included in a multimodal prompt. The returned ExLlamaV2MMEmbedding object is then passed to the tokenizer's encode method (via the embeddings parameter) and to generation jobs.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/vlm/vision_tower.py
  • Lines: L344-418

Signature

def get_image_embeddings(
    self,
    model: ExLlamaV2,
    tokenizer: ExLlamaV2Tokenizer,
    image: PIL.Image.Image,
    text_alias: str | None = None,
    embeddings_cpu: bool = True
) -> ExLlamaV2MMEmbedding:
    ...

Import

from exllamav2 import ExLlamaV2VisionTower
# get_image_embeddings is a method on ExLlamaV2VisionTower instances

I/O Contract

Inputs

Name Type Required Description
model ExLlamaV2 Yes The loaded text (language) model instance
tokenizer ExLlamaV2Tokenizer Yes The tokenizer for the language model, used to allocate token IDs for the embedding sequence
image PIL.Image.Image Yes The input image to process through the vision pipeline
text_alias str or None No Placeholder text string to represent this image in the prompt; auto-generated if None
embeddings_cpu bool No Whether to move the resulting embeddings to CPU for memory-efficient caching; default True

Outputs

Name Type Description
embedding ExLlamaV2MMEmbedding Multimodal embedding container with embeddings tensor of shape (num_tokens, hidden_size), allocated token IDs, and text_alias for prompt substitution

Usage Examples

Basic

from PIL import Image
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer

# Assume model, tokenizer loaded with vision support
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)

# Load an image and extract embeddings
image = Image.open("/path/to/image.jpg")
vision_model = model.vision_model

embedding = vision_model.get_image_embeddings(
    model=model,
    tokenizer=tokenizer,
    image=image
)

# The embedding.text_alias can now be used in prompts
print(f"Use '{embedding.text_alias}' in your prompt to reference this image")

With Custom Alias

embedding = vision_model.get_image_embeddings(
    model=model,
    tokenizer=tokenizer,
    image=image,
    text_alias="<image_1>",
    embeddings_cpu=False  # Keep on GPU for immediate use
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment