Heuristic:Openai CLIP CLIP Normalization Constants
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Debugging |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
CLIP uses its own dataset-specific normalization constants (mean and std) for image preprocessing, which differ from the standard ImageNet normalization values; using incorrect constants degrades accuracy.
Description
The standard ImageNet normalization uses mean `(0.485, 0.456, 0.406)` and std `(0.229, 0.224, 0.225)`. CLIP was trained on a different dataset (400M image-text pairs from the internet) and uses its own normalization constants: mean `(0.48145466, 0.4578275, 0.40821073)` and std `(0.26862954, 0.26130258, 0.27577711)`. These are baked into the `_transform()` function returned by `clip.load()`. Using ImageNet constants instead will shift the input distribution away from what CLIP expects, degrading feature quality and classification accuracy.
Usage
Apply this heuristic when building custom preprocessing pipelines for CLIP or integrating CLIP with existing data loaders that use ImageNet normalization. Always use the CLIP-specific constants, or better yet, use the `preprocess` transform returned by `clip.load()` directly.
The Insight (Rule of Thumb)
- Action: Use CLIP-specific normalization constants, not ImageNet defaults. Prefer using the `preprocess` transform from `clip.load()` directly.
- Value:
- CLIP mean: `(0.48145466, 0.4578275, 0.40821073)`
- CLIP std: `(0.26862954, 0.26130258, 0.27577711)`
- ImageNet mean (DO NOT USE): `(0.485, 0.456, 0.406)`
- ImageNet std (DO NOT USE): `(0.229, 0.224, 0.225)`
- Trade-off: None; always use the correct constants. Using wrong constants silently degrades results without any error message.
Reasoning
Neural networks are sensitive to input normalization because the learned weights expect a specific input distribution. CLIP's constants are empirically derived from its training data distribution, which differs from ImageNet. While the mean values are close (within 0.01), the standard deviations differ more significantly (e.g., 0.269 vs 0.229 for the red channel). The complete preprocessing pipeline also uses BICUBIC interpolation (not bilinear) for resizing, which further affects feature quality.
Code Evidence
CLIP normalization in _transform from `clip/clip.py:79-86`:
def _transform(n_px):
return Compose([
Resize(n_px, interpolation=BICUBIC),
CenterCrop(n_px),
_convert_image_to_rgb,
ToTensor(),
Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
])
BICUBIC interpolation fallback from `clip/clip.py:16-20`:
try:
from torchvision.transforms import InterpolationMode
BICUBIC = InterpolationMode.BICUBIC
except ImportError:
BICUBIC = Image.BICUBIC