Principle:NVIDIA NeMo Curator CLIP Embedding
| Metadata | |
|---|---|
| Knowledge Sources | Paper: Learning Transferable Visual Models |
| Domains | Data_Curation, Image_Processing, Representation_Learning |
| Last Updated | 2026-02-14 |
Overview
CLIP Embedding is a technique for computing dense vector representations of images using the CLIP ViT-L/14 model, enabling similarity search and downstream filtering in image curation pipelines.
Description
CLIP Embedding in NeMo Curator uses OpenAI's CLIP (Contrastive Language-Image Pretraining) model to project images into a shared vision-language embedding space. Specifically, the ViT-L/14 variant of the CLIP image encoder is used to map raw pixel data into fixed-dimensional dense vectors. These embeddings capture high-level semantic content of images and can be used for a variety of downstream tasks including similarity search, aesthetic quality scoring, NSFW content detection, and semantic deduplication. The embedding computation is performed on GPU for efficiency, and the resulting vectors are stored alongside the image data in ImageBatch objects for consumption by subsequent pipeline stages.
Usage
Use CLIP Embedding after the Image Ingestion stage and before any filtering or deduplication stages that require embedding vectors as input. This stage is essential when downstream stages such as aesthetic filtering, NSFW filtering, or semantic deduplication operate on CLIP embedding representations rather than raw pixel data.
Theoretical Basis
CLIP (Contrastive Language-Image Pretraining) learns joint vision-language representations through contrastive learning on large-scale image-text pairs. The model consists of two encoders: an image encoder (Vision Transformer, ViT) and a text encoder (Transformer). During training, the model learns to maximize the cosine similarity between matching image-text pairs while minimizing the similarity between non-matching pairs. The image encoder maps raw pixels to fixed-dimensional vectors in a shared embedding space where semantically similar images cluster together. The ViT-L/14 variant uses a Large Vision Transformer with 14x14 pixel patch size, producing high-quality embeddings that capture both low-level visual features and high-level semantic content. These embeddings serve as a universal representation for downstream tasks including zero-shot classification, similarity search, and as input features for lightweight classifiers such as aesthetic quality predictors and NSFW detectors.