Implementation:OpenGVLab InternVL CLIPVisionTower

Knowledge Sources	OpenGVLab_InternVL
Domains	Vision Encoder, Multimodal Models, LLaVA
Last Updated	2026-02-07 14:00 GMT

Overview

Unified vision tower wrapper that dispatches to CLIP, EVA-CLIP, InternViT-6B, or InternVL-14B vision encoders based on the model name, providing a consistent interface for the LLaVA architecture.

Description

CLIPVisionTower is an nn.Module that serves as the central vision encoder abstraction in the LLaVA pipeline. It detects the encoder type by inspecting the vision_tower_name string: names containing "EVA"/"eva" route to EvaCLIPVisionModel, names matching InternViT-6B patterns route to InternVisionModel, names matching InternVL-14B patterns route to InternVLModel, and all others default to the standard CLIPVisionModel. Each encoder type gets the appropriate CLIPImageProcessor configuration -- InternViT/InternVL models use ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) while CLIP/EVA-CLIP use their pretrained processor settings. The feature_select() method extracts features from a configurable hidden layer (select_layer), supporting either patch features (excluding CLS token) or cls_patch features (all tokens). The forward pass runs images through the frozen vision tower and returns features, with InternVL-14B additionally returning query outputs alongside image features. The module supports delay_load for lazy initialization, exposing config-only access via cfg_only. Properties expose hidden_size, num_patches (with InternVL-14B adding 96 query token patches), dtype, and device.

Usage

Use this class as the vision backbone in any LLaVA model variant. It is instantiated by the build_vision_tower factory function and accessed via LlavaMetaModel.get_vision_tower().

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/model/multimodal_encoder/clip_encoder.py
Lines: 1-134

Signature

class CLIPVisionTower(nn.Module):
    def __init__(self, vision_tower, args, delay_load=False): ...
    def load_model(self): ...
    def feature_select(self, image_forward_outs): ...
    @torch.no_grad()
    def forward(self, images): ...
    @property
    def hidden_size(self): ...
    @property
    def num_patches(self): ...

Import

from llava.model.multimodal_encoder.clip_encoder import CLIPVisionTower

I/O Contract

Inputs

Name	Type	Required	Description
images	torch.Tensor or List[torch.Tensor]	Yes	Image tensors of shape (B, C, H, W) or list of (C, H, W) tensors

Outputs

Name	Type	Description
image_features	torch.Tensor or List	Extracted patch features from the selected hidden layer; for InternVL-14B returns [features, query_outputs]

Usage Examples

Basic Usage

from llava.model.multimodal_encoder.clip_encoder import CLIPVisionTower

# Create with InternViT-6B backend
tower = CLIPVisionTower("path/to/InternViT-6B-448px", args)
tower.load_model()

# Encode images
features = tower(image_tensor)  # shape: (B, num_patches, hidden_size)

Related Pages

Principle:OpenGVLab_InternVL_Vision_Encoder_Abstraction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment