Implementation:Openai CLIP CLIP Encode Image For Linear Probe

Knowledge Sources	OpenAI CLIP Learning Transferable Visual Models From Natural Language Supervision
Domains	Vision, Transfer_Learning, Evaluation
Last Updated	2026-02-13 22:00 GMT

Overview

Concrete tool for extracting frozen CLIP image features over an entire dataset for linear probe evaluation, using CLIP.encode_image() in a no-gradient loop.

Description

This implementation uses CLIP.encode_image() (clip/model.py:L340-341) in a specific pattern: iterating over a DataLoader under torch.no_grad(), L2-normalizing the resulting feature vectors, and accumulating them as numpy arrays for use with scikit-learn. This pattern is demonstrated in the CLIP README (lines 141-191) for CIFAR-100 evaluation.

Unlike the general image encoding use case, this implementation:

Processes an entire dataset batch-by-batch, not individual images
Produces numpy arrays (not torch tensors) for sklearn compatibility
Always includes L2 normalization
Extracts features for both train and test splits separately

Usage

Use this implementation when evaluating CLIP's visual representations via linear probing on a classification benchmark like CIFAR-100.

Code Reference

Source Location

Repository: OpenAI CLIP
File: clip/model.py (encode_image at L340-341); usage pattern in README.md (L141-191)

Signature

# The underlying API is CLIP.encode_image():
def encode_image(self, image: torch.Tensor) -> torch.Tensor:
    return self.visual(image.type(self.dtype))

# Used in the linear probe pattern:
# model.encode_image(images) -> normalize -> .cpu().numpy()

Import

import clip
import torch
import numpy as np

model, preprocess = clip.load("ViT-B/32", device=device)

I/O Contract

Inputs

Name	Type	Required	Description
dataloader	torch.utils.data.DataLoader	Yes	DataLoader wrapping a dataset with CLIP's preprocess transform applied, yielding (images, labels) batches
model	CLIP	Yes	Loaded CLIP model in eval mode
device	str or torch.device	Yes	Device where the model resides

Outputs

Name	Type	Description
features	numpy.ndarray	L2-normalized image feature vectors, shape [N, embed_dim], dtype float32
labels	numpy.ndarray	Corresponding class labels, shape [N]

Usage Examples

Full Linear Probe Feature Extraction

import os
import clip
import torch
import numpy as np
from torchvision.datasets import CIFAR100
from torch.utils.data import DataLoader

# 1. Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 2. Create dataloaders with CLIP preprocessing
root = os.path.expanduser("~/.cache")
train_dataset = CIFAR100(root, download=True, train=True, transform=preprocess)
test_dataset = CIFAR100(root, download=True, train=False, transform=preprocess)

train_loader = DataLoader(train_dataset, batch_size=100, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=100, num_workers=2)

# 3. Extract features
def get_features(dataloader):
    all_features = []
    all_labels = []

    with torch.no_grad():
        for images, labels in dataloader:
            features = model.encode_image(images.to(device))
            # L2 normalize
            features /= features.norm(dim=-1, keepdim=True)
            all_features.append(features.cpu().numpy())
            all_labels.append(labels.numpy())

    return np.concatenate(all_features), np.concatenate(all_labels)

train_features, train_labels = get_features(train_loader)
test_features, test_labels = get_features(test_loader)
# train_features.shape: [50000, 512]
# test_features.shape:  [10000, 512]

Related Pages

Implements Principle

Principle:Openai_CLIP_Linear_Probe_Feature_Extraction

Requires Environment

Uses Heuristic

Heuristic:Openai_CLIP_L2_Normalization_For_Cosine_Similarity

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment