Implementation:Mlc ai Mlc llm CLIP Vision
Overview
The CLIP Vision module implements the CLIP (Contrastive Language-Image Pre-training) Vision Encoder as a set of TVM Relax neural network modules. It is located at python/mlc_llm/model/vision/clip_vision.py (229 lines).
This module provides a complete implementation of the CLIP vision transformer pipeline: from image patch embedding through multi-head self-attention encoder layers to the final output. It is used as the visual backbone in multimodal LLM architectures that require image understanding capabilities.
Source File
- File:
python/mlc_llm/model/vision/clip_vision.py - Lines: 229
- Module:
mlc_llm.model.vision.clip_vision
Dependencies
| Import | Purpose |
|---|---|
dataclasses |
Used to define CLIPVisionConfig as a dataclass
|
tvm.relax |
TVM Relax framework for neural network operations |
tvm.relax.frontend.nn |
Neural network module base classes (Module, Tensor)
|
tvm.relax.frontend.nn.modules.Conv2D |
2D convolution for patch embedding |
tvm.relax.frontend.nn.op |
Tensor operations: add, broadcast_to, concat, permute_dims, reshape, wrap_nested
|
tvm.relax.op.arange |
Generates position ID sequences |
mlc_llm.op |
Extended operations including the attention function
|
mlc_llm.support.config.ConfigBase |
Base class for model configurations |
Class: CLIPVisionConfig
@dataclasses.dataclass
class CLIPVisionConfig(ConfigBase):
hidden_size: int
image_size: int
intermediate_size: int
num_attention_heads: int
num_hidden_layers: int
patch_size: int
projection_dim: int
vocab_size: int
num_channels: int = 3
layer_norm_eps: float = 1e-06
kwargs: Dict[str, Any] = dataclasses.field(default_factory=dict)
This configuration dataclass holds all hyperparameters for the CLIP vision encoder. The patch_size determines how the input image is divided into patches, while hidden_size controls the embedding dimensionality throughout the transformer layers.
Class: CLIPVisionEmbeddings
Converts raw pixel values into patch embeddings with positional information.
Architecture
- A
Conv2Dlayer with kernel size and stride equal topatch_sizeextracts non-overlapping patch features from the input image. - A learnable
class_embeddingparameter (CLS token) is prepended to the sequence of patch embeddings. - Learnable
position_embeddingvalues are added to encode spatial position information.
def forward(self, pixel_values: Tensor) -> Tensor:
batch_size = pixel_values.shape[0]
patch_embeds = self.patch_embedding(pixel_values) # [*, width, grid, grid]
patch_embeds = reshape(patch_embeds, shape=(batch_size, self.embed_dim, -1))
patch_embeds = permute_dims(patch_embeds, axes=(0, 2, 1)) # [batch, grid*grid, embed_dim]
class_embeds = broadcast_to(
self.class_embedding, shape=(batch_size, 1, self.embed_dim)
)
embeddings = concat([class_embeds, patch_embeds], dim=1)
# ... add positional embeddings
return embeddings
The number of patches is computed as (image_size // patch_size) ** 2, and the total number of positions is num_patches + 1 (including the CLS token).
Function: sigmoid
def sigmoid(x: Tensor, name: str = "sigmoid") -> Tensor:
return wrap_nested(relax.op.sigmoid(x._expr), name)
A utility function that wraps the TVM Relax sigmoid operation, used by the QuickGELU activation.
Class: QuickGELU
class QuickGELU(Module):
def forward(self, input_tensor: Tensor) -> Tensor:
return input_tensor * sigmoid(input_tensor * 1.702)
Implements the QuickGELU activation function, which approximates GELU using the formula x * sigmoid(1.702 * x). This is the activation function used in the original CLIP model.
Class: CLIPMLP
A two-layer feed-forward network with QuickGELU activation:
class CLIPMLP(Module):
def __init__(self, config: CLIPVisionConfig):
self.activation_fn = QuickGELU()
self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
def forward(self, hidden_states: Tensor) -> Tensor:
hidden_states = self.fc1(hidden_states)
hidden_states = self.activation_fn(hidden_states)
hidden_states = self.fc2(hidden_states)
return hidden_states
Class: CLIPAttention
Implements multi-head self-attention for the CLIP vision encoder.
Key Details
- Separate linear projections for Q, K, V (not fused).
- Validates that
embed_dimis divisible bynum_heads. - Uses
op_ext.attention(frommlc_llm.op) for the attention computation, passingNoneas the causal mask (since vision transformers use full bidirectional attention).
def forward(self, hidden_states: Tensor) -> Tensor:
d, h = self.head_dim, self.num_heads
b, s, _ = hidden_states.shape
q = self.q_proj(hidden_states).reshape(b, s, h, d)
k = self.k_proj(hidden_states).reshape(b, s, h, d)
v = self.v_proj(hidden_states).reshape(b, s, h, d)
attn_output = op_ext.attention(q, k, v, None)
attn_output = self.out_proj(attn_output)
return attn_output
Class: CLIPEncoderLayer
A single transformer encoder layer with pre-norm architecture:
- Apply
layer_norm1, then self-attention, then residual connection. - Apply
layer_norm2, then MLP, then residual connection.
def forward(self, hidden_states: Tensor) -> Tensor:
residual = hidden_states
hidden_states = self.layer_norm1(hidden_states)
hidden_states = self.self_attn(hidden_states=hidden_states)
hidden_states = residual + hidden_states
residual = hidden_states
hidden_states = self.layer_norm2(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,)
return outputs
Class: CLIPEncoder
Stacks multiple CLIPEncoderLayer modules and returns all intermediate hidden states (including the initial input embeddings):
def forward(self, inputs_embeds: Tensor) -> Tensor:
hidden_states = inputs_embeds
encoder_states: Tuple[Any, ...] = ()
for _, encoder_layer in enumerate(self.layers):
encoder_states = encoder_states + (hidden_states,)
layer_outputs = encoder_layer(hidden_states)
hidden_states = layer_outputs[0]
encoder_states = encoder_states + (hidden_states,)
return encoder_states
This returns a tuple of length num_hidden_layers + 1, containing all hidden states. This allows downstream modules to select features from any layer.
Class: CLIPVisionTransformer
Combines embeddings, pre-layer-norm, encoder, and post-layer-norm into the full vision transformer pipeline:
class CLIPVisionTransformer(Module):
def __init__(self, config: CLIPVisionConfig):
self.embeddings = CLIPVisionEmbeddings(config)
self.pre_layrnorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
self.encoder = CLIPEncoder(config)
self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
Class: CLIPVisionModel
The top-level model class that wraps CLIPVisionTransformer. It returns the second-to-last hidden state from the encoder (index [-2]), which is a common practice for extracting visual features in CLIP-based vision-language models.
class CLIPVisionModel(Module):
no_quantization: bool = True
def forward(self, pixel_values: Tensor) -> Tensor:
return self.vision_model(pixel_values)[-2]
The no_quantization = True class attribute signals that this module should not be quantized, preserving full-precision visual features.
Module Hierarchy
CLIPVisionModel
CLIPVisionTransformer
CLIPVisionEmbeddings
Conv2D (patch_embedding)
Parameter (class_embedding)
Embedding (position_embedding)
LayerNorm (pre_layrnorm)
CLIPEncoder
CLIPEncoderLayer (x num_hidden_layers)
LayerNorm (layer_norm1)
CLIPAttention
Linear (q_proj, k_proj, v_proj, out_proj)
LayerNorm (layer_norm2)
CLIPMLP
Linear (fc1, fc2)
QuickGELU
LayerNorm (post_layernorm)
Categories
- Vision Encoder
- CLIP Architecture
- Multimodal
- Transformer
- TVM Relax Module