Implementation:Mlc ai Mlc llm CLIP Vision

Overview

The CLIP Vision module implements the CLIP (Contrastive Language-Image Pre-training) Vision Encoder as a set of TVM Relax neural network modules. It is located at python/mlc_llm/model/vision/clip_vision.py (229 lines).

This module provides a complete implementation of the CLIP vision transformer pipeline: from image patch embedding through multi-head self-attention encoder layers to the final output. It is used as the visual backbone in multimodal LLM architectures that require image understanding capabilities.

Source File

File: python/mlc_llm/model/vision/clip_vision.py
Lines: 229
Module: mlc_llm.model.vision.clip_vision

Dependencies

Import	Purpose
`dataclasses`	Used to define `CLIPVisionConfig` as a dataclass
`tvm.relax`	TVM Relax framework for neural network operations
`tvm.relax.frontend.nn`	Neural network module base classes (`Module`, `Tensor`)
`tvm.relax.frontend.nn.modules.Conv2D`	2D convolution for patch embedding
`tvm.relax.frontend.nn.op`	Tensor operations: `add`, `broadcast_to`, `concat`, `permute_dims`, `reshape`, `wrap_nested`
`tvm.relax.op.arange`	Generates position ID sequences
`mlc_llm.op`	Extended operations including the `attention` function
`mlc_llm.support.config.ConfigBase`	Base class for model configurations

Class: CLIPVisionConfig

@dataclasses.dataclass
class CLIPVisionConfig(ConfigBase):
    hidden_size: int
    image_size: int
    intermediate_size: int
    num_attention_heads: int
    num_hidden_layers: int
    patch_size: int
    projection_dim: int
    vocab_size: int
    num_channels: int = 3
    layer_norm_eps: float = 1e-06
    kwargs: Dict[str, Any] = dataclasses.field(default_factory=dict)

This configuration dataclass holds all hyperparameters for the CLIP vision encoder. The patch_size determines how the input image is divided into patches, while hidden_size controls the embedding dimensionality throughout the transformer layers.

Class: CLIPVisionEmbeddings

Converts raw pixel values into patch embeddings with positional information.

Architecture

A Conv2D layer with kernel size and stride equal to patch_size extracts non-overlapping patch features from the input image.
A learnable class_embedding parameter (CLS token) is prepended to the sequence of patch embeddings.
Learnable position_embedding values are added to encode spatial position information.

def forward(self, pixel_values: Tensor) -> Tensor:
    batch_size = pixel_values.shape[0]
    patch_embeds = self.patch_embedding(pixel_values)       # [*, width, grid, grid]
    patch_embeds = reshape(patch_embeds, shape=(batch_size, self.embed_dim, -1))
    patch_embeds = permute_dims(patch_embeds, axes=(0, 2, 1))  # [batch, grid*grid, embed_dim]
    class_embeds = broadcast_to(
        self.class_embedding, shape=(batch_size, 1, self.embed_dim)
    )
    embeddings = concat([class_embeds, patch_embeds], dim=1)
    # ... add positional embeddings
    return embeddings

The number of patches is computed as (image_size // patch_size) ** 2, and the total number of positions is num_patches + 1 (including the CLS token).

Function: sigmoid

def sigmoid(x: Tensor, name: str = "sigmoid") -> Tensor:
    return wrap_nested(relax.op.sigmoid(x._expr), name)

A utility function that wraps the TVM Relax sigmoid operation, used by the QuickGELU activation.

Class: QuickGELU

class QuickGELU(Module):
    def forward(self, input_tensor: Tensor) -> Tensor:
        return input_tensor * sigmoid(input_tensor * 1.702)

Implements the QuickGELU activation function, which approximates GELU using the formula x * sigmoid(1.702 * x). This is the activation function used in the original CLIP model.

Class: CLIPMLP

A two-layer feed-forward network with QuickGELU activation:

class CLIPMLP(Module):
    def __init__(self, config: CLIPVisionConfig):
        self.activation_fn = QuickGELU()
        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)

    def forward(self, hidden_states: Tensor) -> Tensor:
        hidden_states = self.fc1(hidden_states)
        hidden_states = self.activation_fn(hidden_states)
        hidden_states = self.fc2(hidden_states)
        return hidden_states

Class: CLIPAttention

Implements multi-head self-attention for the CLIP vision encoder.

Key Details

Separate linear projections for Q, K, V (not fused).
Validates that embed_dim is divisible by num_heads.
Uses op_ext.attention (from mlc_llm.op) for the attention computation, passing None as the causal mask (since vision transformers use full bidirectional attention).

def forward(self, hidden_states: Tensor) -> Tensor:
    d, h = self.head_dim, self.num_heads
    b, s, _ = hidden_states.shape
    q = self.q_proj(hidden_states).reshape(b, s, h, d)
    k = self.k_proj(hidden_states).reshape(b, s, h, d)
    v = self.v_proj(hidden_states).reshape(b, s, h, d)
    attn_output = op_ext.attention(q, k, v, None)
    attn_output = self.out_proj(attn_output)
    return attn_output

Class: CLIPEncoderLayer

A single transformer encoder layer with pre-norm architecture:

Apply layer_norm1, then self-attention, then residual connection.
Apply layer_norm2, then MLP, then residual connection.

def forward(self, hidden_states: Tensor) -> Tensor:
    residual = hidden_states
    hidden_states = self.layer_norm1(hidden_states)
    hidden_states = self.self_attn(hidden_states=hidden_states)
    hidden_states = residual + hidden_states
    residual = hidden_states
    hidden_states = self.layer_norm2(hidden_states)
    hidden_states = self.mlp(hidden_states)
    hidden_states = residual + hidden_states
    outputs = (hidden_states,)
    return outputs

Class: CLIPEncoder

Stacks multiple CLIPEncoderLayer modules and returns all intermediate hidden states (including the initial input embeddings):

def forward(self, inputs_embeds: Tensor) -> Tensor:
    hidden_states = inputs_embeds
    encoder_states: Tuple[Any, ...] = ()
    for _, encoder_layer in enumerate(self.layers):
        encoder_states = encoder_states + (hidden_states,)
        layer_outputs = encoder_layer(hidden_states)
        hidden_states = layer_outputs[0]
    encoder_states = encoder_states + (hidden_states,)
    return encoder_states

This returns a tuple of length num_hidden_layers + 1, containing all hidden states. This allows downstream modules to select features from any layer.

Class: CLIPVisionTransformer

Combines embeddings, pre-layer-norm, encoder, and post-layer-norm into the full vision transformer pipeline:

class CLIPVisionTransformer(Module):
    def __init__(self, config: CLIPVisionConfig):
        self.embeddings = CLIPVisionEmbeddings(config)
        self.pre_layrnorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
        self.encoder = CLIPEncoder(config)
        self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)

Class: CLIPVisionModel

The top-level model class that wraps CLIPVisionTransformer. It returns the second-to-last hidden state from the encoder (index [-2]), which is a common practice for extracting visual features in CLIP-based vision-language models.

class CLIPVisionModel(Module):
    no_quantization: bool = True

    def forward(self, pixel_values: Tensor) -> Tensor:
        return self.vision_model(pixel_values)[-2]

The no_quantization = True class attribute signals that this module should not be quantized, preserving full-precision visual features.

Module Hierarchy

CLIPVisionModel
  CLIPVisionTransformer
    CLIPVisionEmbeddings
      Conv2D (patch_embedding)
      Parameter (class_embedding)
      Embedding (position_embedding)
    LayerNorm (pre_layrnorm)
    CLIPEncoder
      CLIPEncoderLayer (x num_hidden_layers)
        LayerNorm (layer_norm1)
        CLIPAttention
          Linear (q_proj, k_proj, v_proj, out_proj)
        LayerNorm (layer_norm2)
        CLIPMLP
          Linear (fc1, fc2)
          QuickGELU
    LayerNorm (post_layernorm)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment