Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Curator Semantic Deduplication for Video

From Leeroopedia
Knowledge Sources
Domains Data_Curation, Video_Processing, Deduplication
Last Updated 2026-02-14 17:00 GMT

Overview

Technique for identifying semantically similar video clips using embedding-based clustering and pairwise similarity computation.

Description

Semantic Deduplication for Video uses the same underlying algorithm as text semantic deduplication but operates on video clip embeddings (from Cosmos-Embed1). The workflow clusters embeddings using KMeans, computes pairwise cosine similarity within clusters, and identifies duplicates exceeding a similarity threshold.

Usage

Use after embedding computation to remove semantically redundant video clips. Requires a pre-computed embedding column in exported parquet files.

Theoretical Basis

  1. Cluster embeddings into k groups using GPU-accelerated KMeans
  2. Within each cluster, compute pairwise cosine similarity
  3. Mark pairs exceeding similarity threshold (1.0 - epsilon) as duplicates
  4. Rank and select which duplicate to keep (hardest/easiest/random)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment