Implementation:Datajuicer Data juicer DocumentSimhashDeduplicator
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Deduplication |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for near-duplicate document detection using SimHash provided by Data-Juicer.
Description
DocumentSimhashDeduplicator extends Deduplicator and computes SimHash values using the simhash-pybind library over text shingles (n-grams of configurable window size). It supports space, punctuation, and character-level tokenization, with optional lowercase conversion and pattern ignoring. The deduplication phase uses simhash.find_all to identify all document pairs within the Hamming distance threshold, builds a graph of similar documents, and performs BFS clustering to group them. Only the first document in each cluster is retained. The Hamming distance must be less than the configured num_blocks parameter. This provides deterministic fingerprinting well-suited for detecting documents with small textual variations.
Usage
Use when you need to detect and remove near-duplicate documents from text datasets, where documents may have small variations such as formatting changes, minor edits, or boilerplate differences.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/deduplicator/document_simhash_deduplicator.py
Signature
@OPERATORS.register_module("document_simhash_deduplicator")
class DocumentSimhashDeduplicator(Deduplicator):
def __init__(self, tokenization: str = "space",
window_size: PositiveInt = 6,
lowercase: bool = True,
ignore_pattern: Optional[str] = None,
num_blocks: PositiveInt = 6,
hamming_distance: PositiveInt = 4,
*args, **kwargs):
Import
from data_juicer.ops.deduplicator.document_simhash_deduplicator import DocumentSimhashDeduplicator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenization | str | No | Tokenization method: "space", "punctuation", or "character". Default: "space" |
| window_size | PositiveInt | No | Window size of shingling (n-gram size). Default: 6 |
| lowercase | bool | No | Whether to convert text to lower case first. Default: True |
| ignore_pattern | str | No | Regex pattern for sub-strings to ignore during SimHash computation. Default: None |
| num_blocks | PositiveInt | No | Number of blocks in SimHash computing. Default: 6 |
| hamming_distance | PositiveInt | No | Max Hamming distance threshold for near-duplicate detection. Must be less than num_blocks. Default: 4 |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Deduplicated dataset with only the first document from each similarity cluster retained |
| dup_pairs | dict | Dictionary of sampled duplicate pairs (when show_num > 0) |
Usage Examples
process:
- document_simhash_deduplicator:
tokenization: "character"
window_size: 4
hamming_distance: 3
num_blocks: 6