Implementation:Datajuicer Data juicer DocumentMinhashDeduplicator
| Knowledge Sources | |
|---|---|
| Domains | Deduplication, MinHash LSH, Text Processing |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Implements near-duplicate document detection using MinHash with Locality-Sensitive Hashing (LSH), enabling efficient identification and removal of documents that are similar but not identical.
Description
This is the primary near-duplicate detection operator for text data, essential for cleaning large-scale web-crawled datasets where near-duplicate content is pervasive. The code is adapted from the BigCode dataset project.
Helper Functions:
- sha1_hash32() -- 32-bit SHA1 hash function (from datasketch) for token hashing.
- optimal_param() -- Computes optimal LSH bands and rows by minimizing the weighted sum of false positive and false negative probabilities via numerical integration (scipy.integrate).
Main Class: DocumentMinhashDeduplicator Extends the Deduplicator base class. The deduplication pipeline consists of:
1. Hash Computation (compute_hash) -- Tokenizes text using one of four methods (space, punctuation, character, sentencepiece), generates n-gram shingles of configurable window_size, computes MinHash signatures using random permutation hashing with a Mersenne prime ((2^61)-1), and stores band-wise hash segments.
2. Processing (process) -- Constructs LSH hash tables mapping band hashes to document indices, uses a UnionFind data structure for transitive clustering of similar documents, and retains only one document per cluster (the one with the smallest index).
Extended Class: DocumentMinhashDeduplicatorWithUid Extends DocumentMinhashDeduplicator to support incremental deduplication using persistent unique IDs (`__dj__uid`). When combining a deduplicated dataset A with a new dataset B, documents in A (with lower UIDs) are prioritized for retention.
Usage
Configure in YAML under the process list. Supports multiple tokenization methods for different languages (space for English, character for Chinese, sentencepiece for multilingual).
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/deduplicator/document_minhash_deduplicator.py
- Lines: 1-430
Signature
@OPERATORS.register_module("document_minhash_deduplicator")
class DocumentMinhashDeduplicator(Deduplicator):
def __init__(
self, tokenization: str = "space", window_size: PositiveInt = 5,
lowercase: bool = True, ignore_pattern: Optional[str] = None,
num_permutations: PositiveInt = 256,
jaccard_threshold: float = 0.7,
num_bands: Optional[PositiveInt] = None,
num_rows_per_band: Optional[PositiveInt] = None,
tokenizer_model: Optional[str] = None,
*args, **kwargs,
): ...
def compute_hash(self, sample) -> dict: ...
def process(self, dataset, show_num=0) -> Tuple[dataset, dict]: ...
@OPERATORS.register_module("document_minhash_deduplicator_with_uid")
class DocumentMinhashDeduplicatorWithUid(DocumentMinhashDeduplicator): ...
Import
from data_juicer.ops.deduplicator.document_minhash_deduplicator import (
DocumentMinhashDeduplicator,
DocumentMinhashDeduplicatorWithUid,
optimal_param,
sha1_hash32,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenization | str | No | Tokenization method: "space", "punctuation", "character", or "sentencepiece" (default: "space") |
| window_size | int | No | Shingling window size (default: 5) |
| lowercase | bool | No | Convert text to lowercase before hashing (default: True) |
| ignore_pattern | str | No | Regex pattern for sub-strings to ignore |
| num_permutations | int | No | Number of MinHash permutations (default: 256) |
| jaccard_threshold | float | No | Similarity threshold for deduplication (default: 0.7) |
| num_bands | int | No | LSH bands; auto-computed if None |
| num_rows_per_band | int | No | LSH rows per band; auto-computed if None |
| tokenizer_model | str | No | Path to sentencepiece model (required for sentencepiece tokenization) |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Deduplicated dataset with only unique/first-in-cluster documents retained |
| dup_pairs | dict | Sample of duplicate pairs for tracing (when show_num > 0) |
Usage Examples
# In YAML config:
# process:
# - document_minhash_deduplicator:
# tokenization: 'space'
# window_size: 5
# num_permutations: 256
# jaccard_threshold: 0.7
# Programmatic usage:
from data_juicer.ops.deduplicator.document_minhash_deduplicator import (
DocumentMinhashDeduplicator,
)
dedup = DocumentMinhashDeduplicator(
tokenization="space",
window_size=5,
num_permutations=256,
jaccard_threshold=0.7,
)
# Compute hashes for each sample
dataset = dataset.map(dedup.compute_hash)
# Run deduplication
deduped_dataset, dup_pairs = dedup.process(dataset, show_num=5)
print(f"Before: {len(dataset)}, After: {len(deduped_dataset)}")