Implementation:Datajuicer Data juicer DocumentMinhashDeduplicator

Knowledge Sources	Datajuicer_Data_juicer
Domains	Deduplication, MinHash LSH, Text Processing
Last Updated	2026-02-14 16:00 GMT

Overview

Implements near-duplicate document detection using MinHash with Locality-Sensitive Hashing (LSH), enabling efficient identification and removal of documents that are similar but not identical.

Description

This is the primary near-duplicate detection operator for text data, essential for cleaning large-scale web-crawled datasets where near-duplicate content is pervasive. The code is adapted from the BigCode dataset project.

Helper Functions:

sha1_hash32() -- 32-bit SHA1 hash function (from datasketch) for token hashing.
optimal_param() -- Computes optimal LSH bands and rows by minimizing the weighted sum of false positive and false negative probabilities via numerical integration (scipy.integrate).

Main Class: DocumentMinhashDeduplicator Extends the Deduplicator base class. The deduplication pipeline consists of:

1. Hash Computation (compute_hash) -- Tokenizes text using one of four methods (space, punctuation, character, sentencepiece), generates n-gram shingles of configurable window_size, computes MinHash signatures using random permutation hashing with a Mersenne prime ((2^61)-1), and stores band-wise hash segments.

2. Processing (process) -- Constructs LSH hash tables mapping band hashes to document indices, uses a UnionFind data structure for transitive clustering of similar documents, and retains only one document per cluster (the one with the smallest index).

Extended Class: DocumentMinhashDeduplicatorWithUid Extends DocumentMinhashDeduplicator to support incremental deduplication using persistent unique IDs (`__dj__uid`). When combining a deduplicated dataset A with a new dataset B, documents in A (with lower UIDs) are prioritized for retention.

Usage

Configure in YAML under the process list. Supports multiple tokenization methods for different languages (space for English, character for Chinese, sentencepiece for multilingual).

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/deduplicator/document_minhash_deduplicator.py
Lines: 1-430

Signature

@OPERATORS.register_module("document_minhash_deduplicator")
class DocumentMinhashDeduplicator(Deduplicator):
    def __init__(
        self, tokenization: str = "space", window_size: PositiveInt = 5,
        lowercase: bool = True, ignore_pattern: Optional[str] = None,
        num_permutations: PositiveInt = 256,
        jaccard_threshold: float = 0.7,
        num_bands: Optional[PositiveInt] = None,
        num_rows_per_band: Optional[PositiveInt] = None,
        tokenizer_model: Optional[str] = None,
        *args, **kwargs,
    ): ...
    def compute_hash(self, sample) -> dict: ...
    def process(self, dataset, show_num=0) -> Tuple[dataset, dict]: ...

@OPERATORS.register_module("document_minhash_deduplicator_with_uid")
class DocumentMinhashDeduplicatorWithUid(DocumentMinhashDeduplicator): ...

Import

from data_juicer.ops.deduplicator.document_minhash_deduplicator import (
    DocumentMinhashDeduplicator,
    DocumentMinhashDeduplicatorWithUid,
    optimal_param,
    sha1_hash32,
)

I/O Contract

Inputs

Name	Type	Required	Description
tokenization	str	No	Tokenization method: "space", "punctuation", "character", or "sentencepiece" (default: "space")
window_size	int	No	Shingling window size (default: 5)
lowercase	bool	No	Convert text to lowercase before hashing (default: True)
ignore_pattern	str	No	Regex pattern for sub-strings to ignore
num_permutations	int	No	Number of MinHash permutations (default: 256)
jaccard_threshold	float	No	Similarity threshold for deduplication (default: 0.7)
num_bands	int	No	LSH bands; auto-computed if None
num_rows_per_band	int	No	LSH rows per band; auto-computed if None
tokenizer_model	str	No	Path to sentencepiece model (required for sentencepiece tokenization)

Outputs

Name	Type	Description
dataset	Dataset	Deduplicated dataset with only unique/first-in-cluster documents retained
dup_pairs	dict	Sample of duplicate pairs for tracing (when show_num > 0)

Usage Examples

# In YAML config:
# process:
#   - document_minhash_deduplicator:
#       tokenization: 'space'
#       window_size: 5
#       num_permutations: 256
#       jaccard_threshold: 0.7

# Programmatic usage:
from data_juicer.ops.deduplicator.document_minhash_deduplicator import (
    DocumentMinhashDeduplicator,
)

dedup = DocumentMinhashDeduplicator(
    tokenization="space",
    window_size=5,
    num_permutations=256,
    jaccard_threshold=0.7,
)

# Compute hashes for each sample
dataset = dataset.map(dedup.compute_hash)

# Run deduplication
deduped_dataset, dup_pairs = dedup.process(dataset, show_num=5)
print(f"Before: {len(dataset)}, After: {len(deduped_dataset)}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment