Principle:Datajuicer Data juicer Data Deduplication

Domains	Data_Processing, Data_Quality
Last Updated	2026-02-14 17:00 GMT

Overview

A two-phase operator pattern for identifying and removing duplicate data samples across text, image, and video modalities, using hash-based fingerprinting and set/graph-based clustering.

Pattern

Deduplicator operators extend the Deduplicator base class and implement a consistent two-phase approach:

1. Hash Computation (compute_hash) -- Each sample is fingerprinted using a modality-appropriate hashing strategy: MD5 for exact text/video matching, MinHash with LSH for near-duplicate text detection, SimHash with Hamming distance for text variation detection, or perceptual hashing (phash/dhash/whash/ahash) for visual similarity in images.

2. Deduplication Processing (process) -- The process method operates at the dataset level (not per-sample), building hash tables or similarity graphs, clustering duplicates using Union-Find or BFS, and retaining only the first occurrence from each cluster. This is a global operation that requires access to all samples simultaneously.

All deduplicators are registered via @OPERATORS.register_module() and configured through YAML. They optionally support cross-modal deduplication (e.g., combining text and image hashes) and diagnostic output of duplicate pairs via the show_num parameter.

Key Characteristics

Two-phase architecture: compute_hash (per-sample) then process (global dataset operation)
Global operation requiring full dataset access (cannot be parallelized per-sample)
Modality-specific hashing strategies (MD5, MinHash+LSH, SimHash, perceptual hashing)
Optional cross-modal composite keys (e.g., text + image hash tuples)
Configurable similarity thresholds for near-duplicate detection
Cluster-based retention: keeps first occurrence per duplicate cluster
Diagnostic output: optional duplicate pair sampling for traceability

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment