Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Openai CLIP CLIP Encode Image For Linear Probe

From Leeroopedia
Knowledge Sources
Domains Vision, Transfer_Learning, Evaluation
Last Updated 2026-02-13 22:00 GMT

Overview

Concrete tool for extracting frozen CLIP image features over an entire dataset for linear probe evaluation, using CLIP.encode_image() in a no-gradient loop.

Description

This implementation uses CLIP.encode_image() (clip/model.py:L340-341) in a specific pattern: iterating over a DataLoader under torch.no_grad(), L2-normalizing the resulting feature vectors, and accumulating them as numpy arrays for use with scikit-learn. This pattern is demonstrated in the CLIP README (lines 141-191) for CIFAR-100 evaluation.

Unlike the general image encoding use case, this implementation:

  • Processes an entire dataset batch-by-batch, not individual images
  • Produces numpy arrays (not torch tensors) for sklearn compatibility
  • Always includes L2 normalization
  • Extracts features for both train and test splits separately

Usage

Use this implementation when evaluating CLIP's visual representations via linear probing on a classification benchmark like CIFAR-100.

Code Reference

Source Location

  • Repository: OpenAI CLIP
  • File: clip/model.py (encode_image at L340-341); usage pattern in README.md (L141-191)

Signature

# The underlying API is CLIP.encode_image():
def encode_image(self, image: torch.Tensor) -> torch.Tensor:
    return self.visual(image.type(self.dtype))

# Used in the linear probe pattern:
# model.encode_image(images) -> normalize -> .cpu().numpy()

Import

import clip
import torch
import numpy as np

model, preprocess = clip.load("ViT-B/32", device=device)

I/O Contract

Inputs

Name Type Required Description
dataloader torch.utils.data.DataLoader Yes DataLoader wrapping a dataset with CLIP's preprocess transform applied, yielding (images, labels) batches
model CLIP Yes Loaded CLIP model in eval mode
device str or torch.device Yes Device where the model resides

Outputs

Name Type Description
features numpy.ndarray L2-normalized image feature vectors, shape [N, embed_dim], dtype float32
labels numpy.ndarray Corresponding class labels, shape [N]

Usage Examples

Full Linear Probe Feature Extraction

import os
import clip
import torch
import numpy as np
from torchvision.datasets import CIFAR100
from torch.utils.data import DataLoader

# 1. Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 2. Create dataloaders with CLIP preprocessing
root = os.path.expanduser("~/.cache")
train_dataset = CIFAR100(root, download=True, train=True, transform=preprocess)
test_dataset = CIFAR100(root, download=True, train=False, transform=preprocess)

train_loader = DataLoader(train_dataset, batch_size=100, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=100, num_workers=2)

# 3. Extract features
def get_features(dataloader):
    all_features = []
    all_labels = []

    with torch.no_grad():
        for images, labels in dataloader:
            features = model.encode_image(images.to(device))
            # L2 normalize
            features /= features.norm(dim=-1, keepdim=True)
            all_features.append(features.cpu().numpy())
            all_labels.append(labels.numpy())

    return np.concatenate(all_features), np.concatenate(all_labels)

train_features, train_labels = get_features(train_loader)
test_features, test_labels = get_features(test_loader)
# train_features.shape: [50000, 512]
# test_features.shape:  [10000, 512]

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment