Implementation:Mit han lab Llm awq CLIPVisionTower

Knowledge Sources	Mit_han_lab_Llm_awq
Domains	Vision, Model_Architecture
Last Updated	2026-02-15 00:00 GMT

Overview

Wraps a pretrained CLIP vision model as a feature extraction tower for the LLaVA multimodal architecture, supporting configurable hidden layer selection and feature type.

Description

CLIPVisionTower is an nn.Module that encapsulates a HuggingFace CLIPVisionModel for use as the visual encoder in LLaVA-based models. It supports delayed loading (loading only the config initially and deferring full model weight loading until needed), which is useful during model initialization when the vision tower may not be immediately required.

The load_model method instantiates the CLIPImageProcessor and CLIPVisionModel from a pretrained checkpoint and freezes all vision tower parameters (requires_grad=False). The feature_select method extracts hidden states from a configurable layer (controlled by mm_vision_select_layer) and supports two feature selection modes: "patch" (excludes the CLS token, returning only patch features) and "cls_patch" (includes both CLS and patch tokens). The forward method handles both single batch tensors and lists of individual images, running each through the CLIP model with output_hidden_states=True, selecting the appropriate features, and casting back to the input dtype.

Properties expose dummy_feature (a zero tensor for placeholder usage), dtype, device, config (with fallback to cfg_only when not loaded), hidden_size, and num_patches (computed from image_size and patch_size).

Usage

Import CLIPVisionTower when building a LLaVA model that uses CLIP as its vision encoder. Typically instantiated via the build_vision_tower factory function rather than directly.

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/models/llava_base/multimodal_encoder/clip_encoder.py
Lines: 1-97

Signature

class CLIPVisionTower(nn.Module):
    def __init__(self, vision_tower: str, args, delay_load: bool = False): ...
    def load_model(self) -> None: ...
    def feature_select(self, image_forward_outs) -> torch.Tensor: ...
    def forward(self, images: Union[torch.Tensor, List[torch.Tensor]]) -> torch.Tensor: ...

    @property
    def dummy_feature(self) -> torch.Tensor: ...
    @property
    def dtype(self) -> torch.dtype: ...
    @property
    def device(self) -> torch.device: ...
    @property
    def config(self) -> CLIPVisionConfig: ...
    @property
    def hidden_size(self) -> int: ...
    @property
    def num_patches(self) -> int: ...

Import

from tinychat.models.llava_base.multimodal_encoder.clip_encoder import CLIPVisionTower

I/O Contract

Inputs

Name	Type	Required	Description
vision_tower	str	Yes	Name or path of the pretrained CLIP model (e.g., "openai/clip-vit-large-patch14")
args	object	Yes	Configuration object with mm_vision_select_layer and mm_vision_select_feature attributes
delay_load	bool	No	If True, defer weight loading; only load CLIPVisionConfig initially
images	torch.Tensor or List[torch.Tensor]	Yes (forward)	Image tensor(s) of shape (B, C, H, W) or list of (C, H, W) tensors

Outputs

Name	Type	Description
image_features	torch.Tensor	Extracted vision features of shape (B, N, D) where N is num_patches and D is hidden_size

Usage Examples

Building and using the vision tower

from tinychat.models.llava_base.multimodal_encoder.clip_encoder import CLIPVisionTower

tower = CLIPVisionTower(
    "openai/clip-vit-large-patch14-336",
    args=config,
    delay_load=True,
)
tower.load_model()

# Encode a batch of images
image_features = tower(images)  # (B, num_patches, hidden_size)

Related Pages

Principle:Mit_han_lab_Llm_awq_Vision_Transformer_Encoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment