Implementation:Turboderp org Exllamav2 ExLlamaV2VisionTower

Knowledge Sources	ExLlamaV2
Domains	Vision_Language_Models, Multimodal, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for loading and initializing a vision encoder (vision tower) for multimodal inference with ExLlamaV2, provided by exllamav2.

Description

ExLlamaV2VisionTower is the class that encapsulates the vision encoder component of a vision-language model. It is constructed from an ExLlamaV2Config that contains the vision architecture parameters (detected from the model's configuration files). The constructor builds the appropriate vision encoder architecture, including patch embeddings, position encodings, transformer layers, and the multimodal projector.

The class supports three vision encoder architectures:

Pixtral (Mistral's vision-language models)
Qwen2-VL / Qwen2.5-VL (Qwen's vision-language models)
SigLIP (used in Gemma3, LLaVA, and similar models)

The vision tower is loaded as part of the overall model loading process. When model.load() is called on a VLM, the vision tower weights are loaded from the model's safetensor files and the vision tower is made ready for image processing.

Usage

Use this when loading a vision-language model for multimodal inference. The vision tower is typically accessed through the model instance after loading (model.vision_model) rather than being constructed independently.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/vlm/vision_tower.py
Lines: L35-215 (__init__), load via model.py:L266-314

Signature

class ExLlamaV2VisionTower:
    def __init__(self, config: ExLlamaV2Config):
        ...

    def load(self, progress: bool = True):
        ...

Import

from exllamav2 import ExLlamaV2VisionTower

I/O Contract

Inputs

Name	Type	Required	Description
config	ExLlamaV2Config	Yes	Model configuration containing vision architecture parameters (encoder type, hidden size, number of layers, patch size, image size, projector config)
progress	bool	No	Whether to display a progress bar during weight loading; default True

Outputs

Name	Type	Description
vision_tower	ExLlamaV2VisionTower	Loaded vision tower instance with patch embeddings, position embeddings, vision attention/MLP layers, and multimodal projector ready for image processing

Dependencies

torch - Tensor operations and GPU computation
PIL - Image loading and manipulation
exllamav2.conv - Convolution operations for patch embedding
exllamav2.vlm.mmprojector - Multimodal projector implementations
exllamav2.vlm.processor.* - Architecture-specific image preprocessors (Pixtral, Qwen2VL, SigLIP)

Usage Examples

Basic

from exllamav2 import ExLlamaV2, ExLlamaV2Config

# Load a vision-language model (vision tower loads automatically)
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()

# Access the vision tower
vision_model = model.vision_model

Checking Vision Support

# Check if the loaded model has vision capabilities
if model.vision_model is not None:
    print("Model supports vision input")
    # Proceed with image processing
else:
    print("Text-only model")

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Vision_Tower_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment