Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer DocumentMinhashDeduplicator

From Leeroopedia
Knowledge Sources
Domains Deduplication, MinHash LSH, Text Processing
Last Updated 2026-02-14 16:00 GMT

Overview

Implements near-duplicate document detection using MinHash with Locality-Sensitive Hashing (LSH), enabling efficient identification and removal of documents that are similar but not identical.

Description

This is the primary near-duplicate detection operator for text data, essential for cleaning large-scale web-crawled datasets where near-duplicate content is pervasive. The code is adapted from the BigCode dataset project.

Helper Functions:

  • sha1_hash32() -- 32-bit SHA1 hash function (from datasketch) for token hashing.
  • optimal_param() -- Computes optimal LSH bands and rows by minimizing the weighted sum of false positive and false negative probabilities via numerical integration (scipy.integrate).

Main Class: DocumentMinhashDeduplicator Extends the Deduplicator base class. The deduplication pipeline consists of:

1. Hash Computation (compute_hash) -- Tokenizes text using one of four methods (space, punctuation, character, sentencepiece), generates n-gram shingles of configurable window_size, computes MinHash signatures using random permutation hashing with a Mersenne prime ((2^61)-1), and stores band-wise hash segments.

2. Processing (process) -- Constructs LSH hash tables mapping band hashes to document indices, uses a UnionFind data structure for transitive clustering of similar documents, and retains only one document per cluster (the one with the smallest index).

Extended Class: DocumentMinhashDeduplicatorWithUid Extends DocumentMinhashDeduplicator to support incremental deduplication using persistent unique IDs (`__dj__uid`). When combining a deduplicated dataset A with a new dataset B, documents in A (with lower UIDs) are prioritized for retention.

Usage

Configure in YAML under the process list. Supports multiple tokenization methods for different languages (space for English, character for Chinese, sentencepiece for multilingual).

Code Reference

Source Location

  • Repository: Datajuicer_Data_juicer
  • File: data_juicer/ops/deduplicator/document_minhash_deduplicator.py
  • Lines: 1-430

Signature

@OPERATORS.register_module("document_minhash_deduplicator")
class DocumentMinhashDeduplicator(Deduplicator):
    def __init__(
        self, tokenization: str = "space", window_size: PositiveInt = 5,
        lowercase: bool = True, ignore_pattern: Optional[str] = None,
        num_permutations: PositiveInt = 256,
        jaccard_threshold: float = 0.7,
        num_bands: Optional[PositiveInt] = None,
        num_rows_per_band: Optional[PositiveInt] = None,
        tokenizer_model: Optional[str] = None,
        *args, **kwargs,
    ): ...
    def compute_hash(self, sample) -> dict: ...
    def process(self, dataset, show_num=0) -> Tuple[dataset, dict]: ...

@OPERATORS.register_module("document_minhash_deduplicator_with_uid")
class DocumentMinhashDeduplicatorWithUid(DocumentMinhashDeduplicator): ...

Import

from data_juicer.ops.deduplicator.document_minhash_deduplicator import (
    DocumentMinhashDeduplicator,
    DocumentMinhashDeduplicatorWithUid,
    optimal_param,
    sha1_hash32,
)

I/O Contract

Inputs

Name Type Required Description
tokenization str No Tokenization method: "space", "punctuation", "character", or "sentencepiece" (default: "space")
window_size int No Shingling window size (default: 5)
lowercase bool No Convert text to lowercase before hashing (default: True)
ignore_pattern str No Regex pattern for sub-strings to ignore
num_permutations int No Number of MinHash permutations (default: 256)
jaccard_threshold float No Similarity threshold for deduplication (default: 0.7)
num_bands int No LSH bands; auto-computed if None
num_rows_per_band int No LSH rows per band; auto-computed if None
tokenizer_model str No Path to sentencepiece model (required for sentencepiece tokenization)

Outputs

Name Type Description
dataset Dataset Deduplicated dataset with only unique/first-in-cluster documents retained
dup_pairs dict Sample of duplicate pairs for tracing (when show_num > 0)

Usage Examples

# In YAML config:
# process:
#   - document_minhash_deduplicator:
#       tokenization: 'space'
#       window_size: 5
#       num_permutations: 256
#       jaccard_threshold: 0.7

# Programmatic usage:
from data_juicer.ops.deduplicator.document_minhash_deduplicator import (
    DocumentMinhashDeduplicator,
)

dedup = DocumentMinhashDeduplicator(
    tokenization="space",
    window_size=5,
    num_permutations=256,
    jaccard_threshold=0.7,
)

# Compute hashes for each sample
dataset = dataset.map(dedup.compute_hash)

# Run deduplication
deduped_dataset, dup_pairs = dedup.process(dataset, show_num=5)
print(f"Before: {len(dataset)}, After: {len(deduped_dataset)}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment