Heuristic:Openai CLIP Class Name Curation

Knowledge Sources	OpenAI CLIP Learning Transferable Visual Models From Natural Language Supervision Prompt Engineering for ImageNet notebook
Domains	NLP, Computer_Vision, Optimization
Last Updated	2026-02-13 22:00 GMT

Overview

Curating class label names to resolve ambiguity improves CLIP zero-shot accuracy by ~1.5%; CLIP interprets labels as natural language so polysemous words must be disambiguated.

Description

CLIP interprets class labels as natural language text, which means ambiguous or polysemous labels can confuse the model. For example, the ImageNet class "nail" is interpreted by CLIP as "fingernail" rather than "metal nail", and "kite" is interpreted as the flying toy rather than the bird of prey. The CLIP team manually curated the 1,000 ImageNet class names via trial-and-error testing on the training set, improving top-1 accuracy by ~1.5% on ViT-B/32 compared to the default class names. The authors estimate an additional 0.5-1% could be gained from further curation work.

Usage

Apply this heuristic when using CLIP for zero-shot classification on any dataset. Review class names for ambiguity and modify them to match how CLIP interprets natural language. This is especially important for datasets where class names are short, abbreviated, or polysemous.

The Insight (Rule of Thumb)

Action: Review all class labels for polysemy and ambiguity. Add disambiguating context to confusing labels.
Value: +1.5% top-1 accuracy on ViT-B/32 (potentially +0.5-1% more with further effort).
Trade-off: Manual effort required; trial-and-error process guided by per-class accuracy analysis.
Examples of fixes:
- `nail` → `metal nail` (CLIP defaults to "fingernail")
- `kite` → `kite (bird of prey)` (CLIP defaults to flying toy)
- `red wolf` → `red wolf or maned wolf` (dataset contains mislabeled maned wolves)
- `crane` → `crane bird` or `construction crane` (Section 3.1.4 of paper)

Reasoning

CLIP was trained on 400M internet image-text pairs where the natural language meaning dominates. When a class name has multiple meanings, CLIP's text encoder produces an embedding that reflects the most common internet usage, which may not match the dataset's intended meaning. Disambiguating the class name steers the text embedding toward the correct visual concept. This is a form of prompt engineering at the label level, complementary to template engineering at the sentence level.

Code Evidence

Commentary from notebook cell 9 (markdown):

"These edits were made via trial and error and concentrated on the lowest performing classes according to top_1 and top_5 accuracy on the ImageNet training set for the RN50, RN101, and RN50x4 models. These tweaks improve top_1 by 1.5% on ViT-B/32 over using the default class names."

Specific examples from notebook cell 9:

- CLIP interprets "nail" as "fingernail" so we changed the label to "metal nail".
- ImageNet kite class refers to the bird of prey, not the flying toy, so we changed "kite" to "kite (bird of prey)"
- The ImageNet class for red wolf seems to include a lot of mislabeled maned wolfs so we changed "red wolf" to "red wolf or maned wolf"

Curated class name examples from notebook cell 8:

imagenet_classes = ["tench", "goldfish", "great white shark", ..., "metal nail", ..., "kite (bird of prey)", ..., "red wolf or maned wolf", ...]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment