Principle:Datajuicer Data juicer Data Deduplication
| Domains | Data_Processing, Data_Quality |
|---|---|
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A two-phase operator pattern for identifying and removing duplicate data samples across text, image, and video modalities, using hash-based fingerprinting and set/graph-based clustering.
Pattern
Deduplicator operators extend the Deduplicator base class and implement a consistent two-phase approach:
1. Hash Computation (compute_hash) -- Each sample is fingerprinted using a modality-appropriate hashing strategy: MD5 for exact text/video matching, MinHash with LSH for near-duplicate text detection, SimHash with Hamming distance for text variation detection, or perceptual hashing (phash/dhash/whash/ahash) for visual similarity in images.
2. Deduplication Processing (process) -- The process method operates at the dataset level (not per-sample), building hash tables or similarity graphs, clustering duplicates using Union-Find or BFS, and retaining only the first occurrence from each cluster. This is a global operation that requires access to all samples simultaneously.
All deduplicators are registered via @OPERATORS.register_module() and configured through YAML. They optionally support cross-modal deduplication (e.g., combining text and image hashes) and diagnostic output of duplicate pairs via the show_num parameter.
Key Characteristics
- Two-phase architecture: compute_hash (per-sample) then process (global dataset operation)
- Global operation requiring full dataset access (cannot be parallelized per-sample)
- Modality-specific hashing strategies (MD5, MinHash+LSH, SimHash, perceptual hashing)
- Optional cross-modal composite keys (e.g., text + image hash tuples)
- Configurable similarity thresholds for near-duplicate detection
- Cluster-based retention: keeps first occurrence per duplicate cluster
- Diagnostic output: optional duplicate pair sampling for traceability
Implementations
- Implementation:Datajuicer_Data_juicer_DocumentDeduplicator
- Implementation:Datajuicer_Data_juicer_DocumentMinhashDeduplicator
- Implementation:Datajuicer_Data_juicer_DocumentSimhashDeduplicator
- Implementation:Datajuicer_Data_juicer_ImageDeduplicator
- Implementation:Datajuicer_Data_juicer_VideoDeduplicator