Principle:Openai CLIP Image Feature Encoding

Knowledge Sources	Learning Transferable Visual Models From Natural Language Supervision An Image is Worth 16x16 Words Deep Residual Learning for Image Recognition
Domains	Vision, Deep_Learning, Representation_Learning
Last Updated	2026-02-13 22:00 GMT

Overview

A vision encoding mechanism that maps preprocessed image tensors into a shared embedding space where they can be compared directly with text embeddings via cosine similarity.

Description

Image Feature Encoding is the process of transforming a preprocessed image tensor into a fixed-dimensional feature vector that captures the semantic content of the image. In CLIP, this is performed by a vision encoder that maps images into a shared embedding space with text. The model supports two vision encoder architectures:

Vision Transformer (ViT): Splits the image into fixed-size patches, projects each patch to an embedding, prepends a learnable [CLS] token, adds positional embeddings, processes through transformer layers, and projects the [CLS] token output through a learned linear projection.
Modified ResNet: A ResNet variant with a 3-conv stem (instead of 1), anti-aliased strided convolutions (avgpool before strided conv), and a QKV attention pooling layer (instead of global average pooling) that produces the final embedding.

Both architectures produce an output vector of dimension embed_dim (e.g., 512 for ViT-B/32, 768 for ViT-L/14) that lies in the same space as text embeddings.

Usage

Use this principle whenever you need to extract visual features from images for comparison with text features, visual similarity search, or downstream classification. The image encoder is used in zero-shot classification, linear probing, and any task requiring image embeddings.

Theoretical Basis

The image encoder maps an image to a point in the joint embedding space. The two supported architectures differ in their approach:

Vision Transformer:

# Pseudo-code for ViT encoding
patches = conv2d(image, patch_size)       # [B, width, grid, grid]
patches = reshape_and_permute(patches)     # [B, grid^2, width]
tokens = concat([cls_token, patches])      # [B, grid^2+1, width]
tokens = tokens + positional_embedding     # Add position info
tokens = layer_norm(tokens)
tokens = transformer(tokens)               # L layers of self-attention
cls_output = layer_norm(tokens[:, 0, :])   # Take [CLS] token
embedding = cls_output @ projection_matrix # Project to embed_dim

Modified ResNet:

# Pseudo-code for ResNet encoding
x = three_conv_stem(image)     # 3-layer stem instead of 1
x = residual_layers(x)         # 4 residual layer groups
embedding = attention_pool(x)  # QKV attention pooling (not avg pool)

The output embeddings are not L2-normalized by encode_image() itself. Normalization is applied downstream when computing similarities (in the forward() method or manually).

Related Pages

Implemented By

Implementation:Openai_CLIP_CLIP_Encode_Image

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment