Implementation:Openai CLIP CLIP Encode Image For Linear Probe
| Knowledge Sources | |
|---|---|
| Domains | Vision, Transfer_Learning, Evaluation |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Concrete tool for extracting frozen CLIP image features over an entire dataset for linear probe evaluation, using CLIP.encode_image() in a no-gradient loop.
Description
This implementation uses CLIP.encode_image() (clip/model.py:L340-341) in a specific pattern: iterating over a DataLoader under torch.no_grad(), L2-normalizing the resulting feature vectors, and accumulating them as numpy arrays for use with scikit-learn. This pattern is demonstrated in the CLIP README (lines 141-191) for CIFAR-100 evaluation.
Unlike the general image encoding use case, this implementation:
- Processes an entire dataset batch-by-batch, not individual images
- Produces numpy arrays (not torch tensors) for sklearn compatibility
- Always includes L2 normalization
- Extracts features for both train and test splits separately
Usage
Use this implementation when evaluating CLIP's visual representations via linear probing on a classification benchmark like CIFAR-100.
Code Reference
Source Location
- Repository: OpenAI CLIP
- File: clip/model.py (encode_image at L340-341); usage pattern in README.md (L141-191)
Signature
# The underlying API is CLIP.encode_image():
def encode_image(self, image: torch.Tensor) -> torch.Tensor:
return self.visual(image.type(self.dtype))
# Used in the linear probe pattern:
# model.encode_image(images) -> normalize -> .cpu().numpy()
Import
import clip
import torch
import numpy as np
model, preprocess = clip.load("ViT-B/32", device=device)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataloader | torch.utils.data.DataLoader | Yes | DataLoader wrapping a dataset with CLIP's preprocess transform applied, yielding (images, labels) batches |
| model | CLIP | Yes | Loaded CLIP model in eval mode |
| device | str or torch.device | Yes | Device where the model resides |
Outputs
| Name | Type | Description |
|---|---|---|
| features | numpy.ndarray | L2-normalized image feature vectors, shape [N, embed_dim], dtype float32 |
| labels | numpy.ndarray | Corresponding class labels, shape [N] |
Usage Examples
Full Linear Probe Feature Extraction
import os
import clip
import torch
import numpy as np
from torchvision.datasets import CIFAR100
from torch.utils.data import DataLoader
# 1. Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# 2. Create dataloaders with CLIP preprocessing
root = os.path.expanduser("~/.cache")
train_dataset = CIFAR100(root, download=True, train=True, transform=preprocess)
test_dataset = CIFAR100(root, download=True, train=False, transform=preprocess)
train_loader = DataLoader(train_dataset, batch_size=100, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=100, num_workers=2)
# 3. Extract features
def get_features(dataloader):
all_features = []
all_labels = []
with torch.no_grad():
for images, labels in dataloader:
features = model.encode_image(images.to(device))
# L2 normalize
features /= features.norm(dim=-1, keepdim=True)
all_features.append(features.cpu().numpy())
all_labels.append(labels.numpy())
return np.concatenate(all_features), np.concatenate(all_labels)
train_features, train_labels = get_features(train_loader)
test_features, test_labels = get_features(test_loader)
# train_features.shape: [50000, 512]
# test_features.shape: [10000, 512]