Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datajuicer Data juicer Data Deduplication

From Leeroopedia
Domains Data_Processing, Data_Quality
Last Updated 2026-02-14 17:00 GMT

Overview

A two-phase operator pattern for identifying and removing duplicate data samples across text, image, and video modalities, using hash-based fingerprinting and set/graph-based clustering.

Pattern

Deduplicator operators extend the Deduplicator base class and implement a consistent two-phase approach:

1. Hash Computation (compute_hash) -- Each sample is fingerprinted using a modality-appropriate hashing strategy: MD5 for exact text/video matching, MinHash with LSH for near-duplicate text detection, SimHash with Hamming distance for text variation detection, or perceptual hashing (phash/dhash/whash/ahash) for visual similarity in images.

2. Deduplication Processing (process) -- The process method operates at the dataset level (not per-sample), building hash tables or similarity graphs, clustering duplicates using Union-Find or BFS, and retaining only the first occurrence from each cluster. This is a global operation that requires access to all samples simultaneously.

All deduplicators are registered via @OPERATORS.register_module() and configured through YAML. They optionally support cross-modal deduplication (e.g., combining text and image hashes) and diagnostic output of duplicate pairs via the show_num parameter.

Key Characteristics

  • Two-phase architecture: compute_hash (per-sample) then process (global dataset operation)
  • Global operation requiring full dataset access (cannot be parallelized per-sample)
  • Modality-specific hashing strategies (MD5, MinHash+LSH, SimHash, perceptual hashing)
  • Optional cross-modal composite keys (e.g., text + image hash tuples)
  • Configurable similarity thresholds for near-duplicate detection
  • Cluster-based retention: keeps first occurrence per duplicate cluster
  • Diagnostic output: optional duplicate pair sampling for traceability

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment