Implementation:OpenGVLab InternVL CLIPVisionTower
| Knowledge Sources | |
|---|---|
| Domains | Vision Encoder, Multimodal Models, LLaVA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Unified vision tower wrapper that dispatches to CLIP, EVA-CLIP, InternViT-6B, or InternVL-14B vision encoders based on the model name, providing a consistent interface for the LLaVA architecture.
Description
CLIPVisionTower is an nn.Module that serves as the central vision encoder abstraction in the LLaVA pipeline. It detects the encoder type by inspecting the vision_tower_name string: names containing "EVA"/"eva" route to EvaCLIPVisionModel, names matching InternViT-6B patterns route to InternVisionModel, names matching InternVL-14B patterns route to InternVLModel, and all others default to the standard CLIPVisionModel. Each encoder type gets the appropriate CLIPImageProcessor configuration -- InternViT/InternVL models use ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) while CLIP/EVA-CLIP use their pretrained processor settings. The feature_select() method extracts features from a configurable hidden layer (select_layer), supporting either patch features (excluding CLS token) or cls_patch features (all tokens). The forward pass runs images through the frozen vision tower and returns features, with InternVL-14B additionally returning query outputs alongside image features. The module supports delay_load for lazy initialization, exposing config-only access via cfg_only. Properties expose hidden_size, num_patches (with InternVL-14B adding 96 query token patches), dtype, and device.
Usage
Use this class as the vision backbone in any LLaVA model variant. It is instantiated by the build_vision_tower factory function and accessed via LlavaMetaModel.get_vision_tower().
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/multimodal_encoder/clip_encoder.py
- Lines: 1-134
Signature
class CLIPVisionTower(nn.Module):
def __init__(self, vision_tower, args, delay_load=False): ...
def load_model(self): ...
def feature_select(self, image_forward_outs): ...
@torch.no_grad()
def forward(self, images): ...
@property
def hidden_size(self): ...
@property
def num_patches(self): ...
Import
from llava.model.multimodal_encoder.clip_encoder import CLIPVisionTower
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| images | torch.Tensor or List[torch.Tensor] | Yes | Image tensors of shape (B, C, H, W) or list of (C, H, W) tensors |
Outputs
| Name | Type | Description |
|---|---|---|
| image_features | torch.Tensor or List | Extracted patch features from the selected hidden layer; for InternVL-14B returns [features, query_outputs] |
Usage Examples
Basic Usage
from llava.model.multimodal_encoder.clip_encoder import CLIPVisionTower
# Create with InternViT-6B backend
tower = CLIPVisionTower("path/to/InternViT-6B-448px", args)
tower.load_model()
# Encode images
features = tower(image_tensor) # shape: (B, num_patches, hidden_size)