Principle:NVIDIA NeMo Curator Semantic Deduplication for Video
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Video_Processing, Deduplication |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Technique for identifying semantically similar video clips using embedding-based clustering and pairwise similarity computation.
Description
Semantic Deduplication for Video uses the same underlying algorithm as text semantic deduplication but operates on video clip embeddings (from Cosmos-Embed1). The workflow clusters embeddings using KMeans, computes pairwise cosine similarity within clusters, and identifies duplicates exceeding a similarity threshold.
Usage
Use after embedding computation to remove semantically redundant video clips. Requires a pre-computed embedding column in exported parquet files.
Theoretical Basis
- Cluster embeddings into k groups using GPU-accelerated KMeans
- Within each cluster, compute pairwise cosine similarity
- Mark pairs exceeding similarity threshold (1.0 - epsilon) as duplicates
- Rank and select which duplicate to keep (hardest/easiest/random)
Related Pages
Implemented By
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment