Implementation:NVIDIA NeMo Curator SemanticDeduplicationWorkflow for Video
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Video_Processing, Deduplication |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete workflow for embedding-based semantic deduplication of video clips provided by NeMo Curator.
Description
The SemanticDeduplicationWorkflow orchestrates KMeans clustering, pairwise similarity computation, and duplicate identification on video clip embeddings. It uses RAPIDS cuML for GPU-accelerated KMeans and configurable ranking strategies for deciding which duplicates to keep.
Usage
Import this workflow after exporting video embeddings to parquet files. Configure n_clusters based on dataset size and eps for similarity threshold.
Code Reference
Source Location
- Repository: NeMo Curator
- File: nemo_curator/stages/deduplication/semantic/workflow.py
- Lines: L48-420
Signature
class SemanticDeduplicationWorkflow(WorkflowBase):
def __init__(
self,
input_path: str | list[str],
output_path: str,
n_clusters: int,
id_field: str = "id",
embedding_field: str = "embeddings",
distance_metric: Literal["cosine", "l2"] = "cosine",
which_to_keep: Literal["hard", "easy", "random"] = "hard",
eps: float | None = None,
...
):
Import
from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_path | str | Yes | Path to embedding parquet files |
| n_clusters | int | Yes | Number of KMeans clusters |
Outputs
| Name | Type | Description |
|---|---|---|
| result | WorkflowRunResult | Execution metadata |
| duplicate_ids | Parquet files | Written to output_path/duplicates/ |
Usage Examples
from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow
workflow = SemanticDeduplicationWorkflow(
input_path="./data/video_embeddings",
output_path="./output/video_dedup",
n_clusters=1000,
embedding_field="cosmos_embed1_embedding",
distance_metric="cosine",
eps=0.1,
)
result = workflow.run()
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment