Implementation:Openai CLIP Transform
| Knowledge Sources | |
|---|---|
| Domains | Vision, Preprocessing |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Concrete tool for building the CLIP image preprocessing pipeline provided by the OpenAI CLIP library.
Description
The _transform() internal function constructs a torchvision.transforms.Compose pipeline that converts a raw PIL image of any size into a normalized tensor of shape [3, n_px, n_px]. The pipeline uses bicubic interpolation for resizing and CLIP-specific normalization constants computed from the WebImageText training dataset.
Users do not call _transform() directly. Instead, it is called internally by clip.load(), which returns the resulting Compose object as the second element of its return tuple. The preprocessing pipeline is matched to the specific model variant's input resolution.
Usage
Use the preprocess object returned by clip.load() to transform PIL images before passing them to the model. Apply it to individual images, then batch the resulting tensors with torch.stack() or use it as a dataset transform.
Code Reference
Source Location
- Repository: OpenAI CLIP
- File: clip/clip.py
- Lines: L75-86
Signature
def _transform(n_px: int) -> torchvision.transforms.Compose:
"""Build the CLIP image preprocessing pipeline.
Parameters
----------
n_px : int
Target resolution in pixels. Derived from model.visual.input_resolution.
Returns
-------
torchvision.transforms.Compose
A composition of: Resize(n_px, bicubic) -> CenterCrop(n_px) ->
_convert_image_to_rgb -> ToTensor() ->
Normalize(mean, std)
"""
return Compose([
Resize(n_px, interpolation=BICUBIC),
CenterCrop(n_px),
_convert_image_to_rgb,
ToTensor(),
Normalize(
(0.48145466, 0.4578275, 0.40821073),
(0.26862954, 0.26130258, 0.27577711)
),
])
Import
# _transform is internal; users access the result via clip.load()
import clip
model, preprocess = clip.load("ViT-B/32")
# preprocess is the Compose object built by _transform()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| image | PIL.Image.Image | Yes | Raw input image of any size and mode (RGB, RGBA, grayscale, palette) |
Outputs
| Name | Type | Description |
|---|---|---|
| tensor | torch.Tensor | Preprocessed image tensor of shape [3, n_px, n_px], dtype float32, with CLIP-specific normalization applied |
Usage Examples
Preprocessing a Single Image
import clip
import torch
from PIL import Image
# Load model and get the preprocess transform
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Open and preprocess an image
image = Image.open("photo.jpg")
image_tensor = preprocess(image).unsqueeze(0).to(device)
# image_tensor shape: [1, 3, 224, 224]
Preprocessing a Batch of Images
import clip
import torch
from PIL import Image
model, preprocess = clip.load("ViT-B/32", device="cuda")
image_paths = ["cat.jpg", "dog.jpg", "bird.jpg"]
images = [preprocess(Image.open(p)) for p in image_paths]
image_batch = torch.stack(images).to("cuda")
# image_batch shape: [3, 3, 224, 224]
Using as a Dataset Transform
import clip
from torchvision.datasets import CIFAR100
from torch.utils.data import DataLoader
model, preprocess = clip.load("ViT-B/32", device="cuda")
# Pass preprocess directly as the transform argument
dataset = CIFAR100(root="~/.cache", download=True, train=False, transform=preprocess)
loader = DataLoader(dataset, batch_size=32, num_workers=2)