Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator SemanticDeduplicationWorkflow for Video

From Leeroopedia
Knowledge Sources
Domains Data_Curation, Video_Processing, Deduplication
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete workflow for embedding-based semantic deduplication of video clips provided by NeMo Curator.

Description

The SemanticDeduplicationWorkflow orchestrates KMeans clustering, pairwise similarity computation, and duplicate identification on video clip embeddings. It uses RAPIDS cuML for GPU-accelerated KMeans and configurable ranking strategies for deciding which duplicates to keep.

Usage

Import this workflow after exporting video embeddings to parquet files. Configure n_clusters based on dataset size and eps for similarity threshold.

Code Reference

Source Location

  • Repository: NeMo Curator
  • File: nemo_curator/stages/deduplication/semantic/workflow.py
  • Lines: L48-420

Signature

class SemanticDeduplicationWorkflow(WorkflowBase):
    def __init__(
        self,
        input_path: str | list[str],
        output_path: str,
        n_clusters: int,
        id_field: str = "id",
        embedding_field: str = "embeddings",
        distance_metric: Literal["cosine", "l2"] = "cosine",
        which_to_keep: Literal["hard", "easy", "random"] = "hard",
        eps: float | None = None,
        ...
    ):

Import

from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow

I/O Contract

Inputs

Name Type Required Description
input_path str Yes Path to embedding parquet files
n_clusters int Yes Number of KMeans clusters

Outputs

Name Type Description
result WorkflowRunResult Execution metadata
duplicate_ids Parquet files Written to output_path/duplicates/

Usage Examples

from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow

workflow = SemanticDeduplicationWorkflow(
    input_path="./data/video_embeddings",
    output_path="./output/video_dedup",
    n_clusters=1000,
    embedding_field="cosmos_embed1_embedding",
    distance_metric="cosine",
    eps=0.1,
)

result = workflow.run()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment