Implementation:Turboderp org Exllamav2 ExLlamaV2VisionTower
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language_Models, Multimodal, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for loading and initializing a vision encoder (vision tower) for multimodal inference with ExLlamaV2, provided by exllamav2.
Description
ExLlamaV2VisionTower is the class that encapsulates the vision encoder component of a vision-language model. It is constructed from an ExLlamaV2Config that contains the vision architecture parameters (detected from the model's configuration files). The constructor builds the appropriate vision encoder architecture, including patch embeddings, position encodings, transformer layers, and the multimodal projector.
The class supports three vision encoder architectures:
- Pixtral (Mistral's vision-language models)
- Qwen2-VL / Qwen2.5-VL (Qwen's vision-language models)
- SigLIP (used in Gemma3, LLaVA, and similar models)
The vision tower is loaded as part of the overall model loading process. When model.load() is called on a VLM, the vision tower weights are loaded from the model's safetensor files and the vision tower is made ready for image processing.
Usage
Use this when loading a vision-language model for multimodal inference. The vision tower is typically accessed through the model instance after loading (model.vision_model) rather than being constructed independently.
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/vlm/vision_tower.py
- Lines: L35-215 (__init__), load via model.py:L266-314
Signature
class ExLlamaV2VisionTower:
def __init__(self, config: ExLlamaV2Config):
...
def load(self, progress: bool = True):
...
Import
from exllamav2 import ExLlamaV2VisionTower
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | ExLlamaV2Config | Yes | Model configuration containing vision architecture parameters (encoder type, hidden size, number of layers, patch size, image size, projector config) |
| progress | bool | No | Whether to display a progress bar during weight loading; default True |
Outputs
| Name | Type | Description |
|---|---|---|
| vision_tower | ExLlamaV2VisionTower | Loaded vision tower instance with patch embeddings, position embeddings, vision attention/MLP layers, and multimodal projector ready for image processing |
Dependencies
- torch - Tensor operations and GPU computation
- PIL - Image loading and manipulation
- exllamav2.conv - Convolution operations for patch embedding
- exllamav2.vlm.mmprojector - Multimodal projector implementations
- exllamav2.vlm.processor.* - Architecture-specific image preprocessors (Pixtral, Qwen2VL, SigLIP)
Usage Examples
Basic
from exllamav2 import ExLlamaV2, ExLlamaV2Config
# Load a vision-language model (vision tower loads automatically)
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
# Access the vision tower
vision_model = model.vision_model
Checking Vision Support
# Check if the loaded model has vision capabilities
if model.vision_model is not None:
print("Model supports vision input")
# Proceed with image processing
else:
print("Text-only model")