Implementation:Mit han lab Llm awq CLIPVisionTower
| Knowledge Sources | |
|---|---|
| Domains | Vision, Model_Architecture |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Wraps a pretrained CLIP vision model as a feature extraction tower for the LLaVA multimodal architecture, supporting configurable hidden layer selection and feature type.
Description
CLIPVisionTower is an nn.Module that encapsulates a HuggingFace CLIPVisionModel for use as the visual encoder in LLaVA-based models. It supports delayed loading (loading only the config initially and deferring full model weight loading until needed), which is useful during model initialization when the vision tower may not be immediately required.
The load_model method instantiates the CLIPImageProcessor and CLIPVisionModel from a pretrained checkpoint and freezes all vision tower parameters (requires_grad=False). The feature_select method extracts hidden states from a configurable layer (controlled by mm_vision_select_layer) and supports two feature selection modes: "patch" (excludes the CLS token, returning only patch features) and "cls_patch" (includes both CLS and patch tokens). The forward method handles both single batch tensors and lists of individual images, running each through the CLIP model with output_hidden_states=True, selecting the appropriate features, and casting back to the input dtype.
Properties expose dummy_feature (a zero tensor for placeholder usage), dtype, device, config (with fallback to cfg_only when not loaded), hidden_size, and num_patches (computed from image_size and patch_size).
Usage
Import CLIPVisionTower when building a LLaVA model that uses CLIP as its vision encoder. Typically instantiated via the build_vision_tower factory function rather than directly.
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/models/llava_base/multimodal_encoder/clip_encoder.py
- Lines: 1-97
Signature
class CLIPVisionTower(nn.Module):
def __init__(self, vision_tower: str, args, delay_load: bool = False): ...
def load_model(self) -> None: ...
def feature_select(self, image_forward_outs) -> torch.Tensor: ...
def forward(self, images: Union[torch.Tensor, List[torch.Tensor]]) -> torch.Tensor: ...
@property
def dummy_feature(self) -> torch.Tensor: ...
@property
def dtype(self) -> torch.dtype: ...
@property
def device(self) -> torch.device: ...
@property
def config(self) -> CLIPVisionConfig: ...
@property
def hidden_size(self) -> int: ...
@property
def num_patches(self) -> int: ...
Import
from tinychat.models.llava_base.multimodal_encoder.clip_encoder import CLIPVisionTower
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vision_tower | str | Yes | Name or path of the pretrained CLIP model (e.g., "openai/clip-vit-large-patch14") |
| args | object | Yes | Configuration object with mm_vision_select_layer and mm_vision_select_feature attributes |
| delay_load | bool | No | If True, defer weight loading; only load CLIPVisionConfig initially |
| images | torch.Tensor or List[torch.Tensor] | Yes (forward) | Image tensor(s) of shape (B, C, H, W) or list of (C, H, W) tensors |
Outputs
| Name | Type | Description |
|---|---|---|
| image_features | torch.Tensor | Extracted vision features of shape (B, N, D) where N is num_patches and D is hidden_size |
Usage Examples
Building and using the vision tower
from tinychat.models.llava_base.multimodal_encoder.clip_encoder import CLIPVisionTower
tower = CLIPVisionTower(
"openai/clip-vit-large-patch14-336",
args=config,
delay_load=True,
)
tower.load_model()
# Encode a batch of images
image_features = tower(images) # (B, num_patches, hidden_size)