Principle:Openai CLIP Linear Probe Feature Extraction
| Knowledge Sources | |
|---|---|
| Domains | Vision, Transfer_Learning, Evaluation |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
A feature extraction strategy that uses a frozen pretrained vision encoder to produce fixed image representations for training a lightweight linear classifier.
Description
Linear Probe Feature Extraction is a standard evaluation protocol for measuring the quality of learned visual representations. Rather than fine-tuning the entire model on a target task, the pretrained vision encoder is frozen (no gradient computation) and used purely as a feature extractor. The extracted features are then used to train a simple linear classifier (e.g., logistic regression).
This approach evaluates how much useful information the pretrained model has already captured, without the confounding effect of task-specific optimization. The process consists of:
- Freeze the encoder: Use torch.no_grad() to prevent gradient computation, treating the vision model as a deterministic feature function.
- Extract features batch-wise: Iterate over the entire dataset using a DataLoader, encoding each batch of images into feature vectors.
- L2 normalize: Normalize the extracted features to unit length, which is standard practice for CLIP embeddings and improves linear classifier performance.
- Accumulate features: Collect all feature vectors and labels across batches into arrays suitable for a linear classifier (typically numpy arrays for scikit-learn).
The key distinction from the general Image Feature Encoding principle is that here the features are extracted over an entire dataset in a no-gradient evaluation loop and converted to numpy arrays for use with scikit-learn rather than remaining as torch tensors for further neural network processing.
Usage
Use this principle when evaluating the quality of a pretrained visual representation on a classification benchmark. Standard benchmarks include CIFAR-10, CIFAR-100, ImageNet, and others. This is a common reporting metric in vision-language model papers.
Theoretical Basis
The linear probe protocol evaluates the linear separability of learned representations:
# Pseudo-code for feature extraction loop
all_features = []
all_labels = []
model.eval()
with torch.no_grad():
for images, labels in dataloader:
features = model.encode_image(images.to(device))
features = features / features.norm(dim=-1, keepdim=True) # L2 normalize
all_features.append(features.cpu().numpy())
all_labels.append(labels.numpy())
train_features = numpy.concatenate(all_features) # [N, embed_dim]
train_labels = numpy.concatenate(all_labels) # [N]
The assumption is that a good representation maps semantically similar images to nearby points in the embedding space, making them linearly separable. A higher linear probe accuracy indicates more useful and structured representations.