Principle:Openai CLIP Image Feature Encoding
| Knowledge Sources | |
|---|---|
| Domains | Vision, Deep_Learning, Representation_Learning |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
A vision encoding mechanism that maps preprocessed image tensors into a shared embedding space where they can be compared directly with text embeddings via cosine similarity.
Description
Image Feature Encoding is the process of transforming a preprocessed image tensor into a fixed-dimensional feature vector that captures the semantic content of the image. In CLIP, this is performed by a vision encoder that maps images into a shared embedding space with text. The model supports two vision encoder architectures:
- Vision Transformer (ViT): Splits the image into fixed-size patches, projects each patch to an embedding, prepends a learnable [CLS] token, adds positional embeddings, processes through transformer layers, and projects the [CLS] token output through a learned linear projection.
- Modified ResNet: A ResNet variant with a 3-conv stem (instead of 1), anti-aliased strided convolutions (avgpool before strided conv), and a QKV attention pooling layer (instead of global average pooling) that produces the final embedding.
Both architectures produce an output vector of dimension embed_dim (e.g., 512 for ViT-B/32, 768 for ViT-L/14) that lies in the same space as text embeddings.
Usage
Use this principle whenever you need to extract visual features from images for comparison with text features, visual similarity search, or downstream classification. The image encoder is used in zero-shot classification, linear probing, and any task requiring image embeddings.
Theoretical Basis
The image encoder maps an image to a point in the joint embedding space. The two supported architectures differ in their approach:
Vision Transformer:
# Pseudo-code for ViT encoding
patches = conv2d(image, patch_size) # [B, width, grid, grid]
patches = reshape_and_permute(patches) # [B, grid^2, width]
tokens = concat([cls_token, patches]) # [B, grid^2+1, width]
tokens = tokens + positional_embedding # Add position info
tokens = layer_norm(tokens)
tokens = transformer(tokens) # L layers of self-attention
cls_output = layer_norm(tokens[:, 0, :]) # Take [CLS] token
embedding = cls_output @ projection_matrix # Project to embed_dim
Modified ResNet:
# Pseudo-code for ResNet encoding
x = three_conv_stem(image) # 3-layer stem instead of 1
x = residual_layers(x) # 4 residual layer groups
embedding = attention_pool(x) # QKV attention pooling (not avg pool)
The output embeddings are not L2-normalized by encode_image() itself. Normalization is applied downstream when computing similarities (in the forward() method or manually).