Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq CLIPVisionTower

From Leeroopedia
Knowledge Sources
Domains Vision, Model_Architecture
Last Updated 2026-02-15 00:00 GMT

Overview

Wraps a pretrained CLIP vision model as a feature extraction tower for the LLaVA multimodal architecture, supporting configurable hidden layer selection and feature type.

Description

CLIPVisionTower is an nn.Module that encapsulates a HuggingFace CLIPVisionModel for use as the visual encoder in LLaVA-based models. It supports delayed loading (loading only the config initially and deferring full model weight loading until needed), which is useful during model initialization when the vision tower may not be immediately required.

The load_model method instantiates the CLIPImageProcessor and CLIPVisionModel from a pretrained checkpoint and freezes all vision tower parameters (requires_grad=False). The feature_select method extracts hidden states from a configurable layer (controlled by mm_vision_select_layer) and supports two feature selection modes: "patch" (excludes the CLS token, returning only patch features) and "cls_patch" (includes both CLS and patch tokens). The forward method handles both single batch tensors and lists of individual images, running each through the CLIP model with output_hidden_states=True, selecting the appropriate features, and casting back to the input dtype.

Properties expose dummy_feature (a zero tensor for placeholder usage), dtype, device, config (with fallback to cfg_only when not loaded), hidden_size, and num_patches (computed from image_size and patch_size).

Usage

Import CLIPVisionTower when building a LLaVA model that uses CLIP as its vision encoder. Typically instantiated via the build_vision_tower factory function rather than directly.

Code Reference

Source Location

Signature

class CLIPVisionTower(nn.Module):
    def __init__(self, vision_tower: str, args, delay_load: bool = False): ...
    def load_model(self) -> None: ...
    def feature_select(self, image_forward_outs) -> torch.Tensor: ...
    def forward(self, images: Union[torch.Tensor, List[torch.Tensor]]) -> torch.Tensor: ...

    @property
    def dummy_feature(self) -> torch.Tensor: ...
    @property
    def dtype(self) -> torch.dtype: ...
    @property
    def device(self) -> torch.device: ...
    @property
    def config(self) -> CLIPVisionConfig: ...
    @property
    def hidden_size(self) -> int: ...
    @property
    def num_patches(self) -> int: ...

Import

from tinychat.models.llava_base.multimodal_encoder.clip_encoder import CLIPVisionTower

I/O Contract

Inputs

Name Type Required Description
vision_tower str Yes Name or path of the pretrained CLIP model (e.g., "openai/clip-vit-large-patch14")
args object Yes Configuration object with mm_vision_select_layer and mm_vision_select_feature attributes
delay_load bool No If True, defer weight loading; only load CLIPVisionConfig initially
images torch.Tensor or List[torch.Tensor] Yes (forward) Image tensor(s) of shape (B, C, H, W) or list of (C, H, W) tensors

Outputs

Name Type Description
image_features torch.Tensor Extracted vision features of shape (B, N, D) where N is num_patches and D is hidden_size

Usage Examples

Building and using the vision tower

from tinychat.models.llava_base.multimodal_encoder.clip_encoder import CLIPVisionTower

tower = CLIPVisionTower(
    "openai/clip-vit-large-patch14-336",
    args=config,
    delay_load=True,
)
tower.load_model()

# Encode a batch of images
image_features = tower(images)  # (B, num_patches, hidden_size)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment