Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL CLIPVisionTower

From Leeroopedia
Revision as of 16:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/OpenGVLab_InternVL_CLIPVisionTower.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Vision Encoder, Multimodal Models, LLaVA
Last Updated 2026-02-07 14:00 GMT

Overview

Unified vision tower wrapper that dispatches to CLIP, EVA-CLIP, InternViT-6B, or InternVL-14B vision encoders based on the model name, providing a consistent interface for the LLaVA architecture.

Description

CLIPVisionTower is an nn.Module that serves as the central vision encoder abstraction in the LLaVA pipeline. It detects the encoder type by inspecting the vision_tower_name string: names containing "EVA"/"eva" route to EvaCLIPVisionModel, names matching InternViT-6B patterns route to InternVisionModel, names matching InternVL-14B patterns route to InternVLModel, and all others default to the standard CLIPVisionModel. Each encoder type gets the appropriate CLIPImageProcessor configuration -- InternViT/InternVL models use ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) while CLIP/EVA-CLIP use their pretrained processor settings. The feature_select() method extracts features from a configurable hidden layer (select_layer), supporting either patch features (excluding CLS token) or cls_patch features (all tokens). The forward pass runs images through the frozen vision tower and returns features, with InternVL-14B additionally returning query outputs alongside image features. The module supports delay_load for lazy initialization, exposing config-only access via cfg_only. Properties expose hidden_size, num_patches (with InternVL-14B adding 96 query token patches), dtype, and device.

Usage

Use this class as the vision backbone in any LLaVA model variant. It is instantiated by the build_vision_tower factory function and accessed via LlavaMetaModel.get_vision_tower().

Code Reference

Source Location

Signature

class CLIPVisionTower(nn.Module):
    def __init__(self, vision_tower, args, delay_load=False): ...
    def load_model(self): ...
    def feature_select(self, image_forward_outs): ...
    @torch.no_grad()
    def forward(self, images): ...
    @property
    def hidden_size(self): ...
    @property
    def num_patches(self): ...

Import

from llava.model.multimodal_encoder.clip_encoder import CLIPVisionTower

I/O Contract

Inputs

Name Type Required Description
images torch.Tensor or List[torch.Tensor] Yes Image tensors of shape (B, C, H, W) or list of (C, H, W) tensors

Outputs

Name Type Description
image_features torch.Tensor or List Extracted patch features from the selected hidden layer; for InternVL-14B returns [features, query_outputs]

Usage Examples

Basic Usage

from llava.model.multimodal_encoder.clip_encoder import CLIPVisionTower

# Create with InternViT-6B backend
tower = CLIPVisionTower("path/to/InternViT-6B-448px", args)
tower.load_model()

# Encode images
features = tower(image_tensor)  # shape: (B, num_patches, hidden_size)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment