Principle:NVIDIA NeMo Curator CLIP Embedding

Metadata
Knowledge Sources	Paper: Learning Transferable Visual Models
Domains	Data_Curation, Image_Processing, Representation_Learning
Last Updated	2026-02-14

Overview

CLIP Embedding is a technique for computing dense vector representations of images using the CLIP ViT-L/14 model, enabling similarity search and downstream filtering in image curation pipelines.

Description

CLIP Embedding in NeMo Curator uses OpenAI's CLIP (Contrastive Language-Image Pretraining) model to project images into a shared vision-language embedding space. Specifically, the ViT-L/14 variant of the CLIP image encoder is used to map raw pixel data into fixed-dimensional dense vectors. These embeddings capture high-level semantic content of images and can be used for a variety of downstream tasks including similarity search, aesthetic quality scoring, NSFW content detection, and semantic deduplication. The embedding computation is performed on GPU for efficiency, and the resulting vectors are stored alongside the image data in ImageBatch objects for consumption by subsequent pipeline stages.

Usage

Use CLIP Embedding after the Image Ingestion stage and before any filtering or deduplication stages that require embedding vectors as input. This stage is essential when downstream stages such as aesthetic filtering, NSFW filtering, or semantic deduplication operate on CLIP embedding representations rather than raw pixel data.

Theoretical Basis

CLIP (Contrastive Language-Image Pretraining) learns joint vision-language representations through contrastive learning on large-scale image-text pairs. The model consists of two encoders: an image encoder (Vision Transformer, ViT) and a text encoder (Transformer). During training, the model learns to maximize the cosine similarity between matching image-text pairs while minimizing the similarity between non-matching pairs. The image encoder maps raw pixels to fixed-dimensional vectors in a shared embedding space where semantically similar images cluster together. The ViT-L/14 variant uses a Large Vision Transformer with 14x14 pixel patch size, producing high-quality embeddings that capture both low-level visual features and high-level semantic content. These embeddings serve as a universal representation for downstream tasks including zero-shot classification, similarity search, and as input features for lightweight classifiers such as aesthetic quality predictors and NSFW detectors.

Related Pages

Implementation:NVIDIA_NeMo_Curator_ImageEmbeddingStage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment