Implementation:Datajuicer Data juicer DocumentSimhashDeduplicator

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Deduplication
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for near-duplicate document detection using SimHash provided by Data-Juicer.

Description

DocumentSimhashDeduplicator extends Deduplicator and computes SimHash values using the simhash-pybind library over text shingles (n-grams of configurable window size). It supports space, punctuation, and character-level tokenization, with optional lowercase conversion and pattern ignoring. The deduplication phase uses simhash.find_all to identify all document pairs within the Hamming distance threshold, builds a graph of similar documents, and performs BFS clustering to group them. Only the first document in each cluster is retained. The Hamming distance must be less than the configured num_blocks parameter. This provides deterministic fingerprinting well-suited for detecting documents with small textual variations.

Usage

Use when you need to detect and remove near-duplicate documents from text datasets, where documents may have small variations such as formatting changes, minor edits, or boilerplate differences.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/deduplicator/document_simhash_deduplicator.py

Signature

@OPERATORS.register_module("document_simhash_deduplicator")
class DocumentSimhashDeduplicator(Deduplicator):
    def __init__(self, tokenization: str = "space",
                 window_size: PositiveInt = 6,
                 lowercase: bool = True,
                 ignore_pattern: Optional[str] = None,
                 num_blocks: PositiveInt = 6,
                 hamming_distance: PositiveInt = 4,
                 *args, **kwargs):

Import

from data_juicer.ops.deduplicator.document_simhash_deduplicator import DocumentSimhashDeduplicator

I/O Contract

Inputs

Name	Type	Required	Description
tokenization	str	No	Tokenization method: "space", "punctuation", or "character". Default: "space"
window_size	PositiveInt	No	Window size of shingling (n-gram size). Default: 6
lowercase	bool	No	Whether to convert text to lower case first. Default: True
ignore_pattern	str	No	Regex pattern for sub-strings to ignore during SimHash computation. Default: None
num_blocks	PositiveInt	No	Number of blocks in SimHash computing. Default: 6
hamming_distance	PositiveInt	No	Max Hamming distance threshold for near-duplicate detection. Must be less than num_blocks. Default: 4

Outputs

Name	Type	Description
dataset	Dataset	Deduplicated dataset with only the first document from each similarity cluster retained
dup_pairs	dict	Dictionary of sampled duplicate pairs (when show_num > 0)

Usage Examples

process:
  - document_simhash_deduplicator:
      tokenization: "character"
      window_size: 4
      hamming_distance: 3
      num_blocks: 6

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment