Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer DocumentSimhashDeduplicator

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Deduplication
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for near-duplicate document detection using SimHash provided by Data-Juicer.

Description

DocumentSimhashDeduplicator extends Deduplicator and computes SimHash values using the simhash-pybind library over text shingles (n-grams of configurable window size). It supports space, punctuation, and character-level tokenization, with optional lowercase conversion and pattern ignoring. The deduplication phase uses simhash.find_all to identify all document pairs within the Hamming distance threshold, builds a graph of similar documents, and performs BFS clustering to group them. Only the first document in each cluster is retained. The Hamming distance must be less than the configured num_blocks parameter. This provides deterministic fingerprinting well-suited for detecting documents with small textual variations.

Usage

Use when you need to detect and remove near-duplicate documents from text datasets, where documents may have small variations such as formatting changes, minor edits, or boilerplate differences.

Code Reference

Source Location

Signature

@OPERATORS.register_module("document_simhash_deduplicator")
class DocumentSimhashDeduplicator(Deduplicator):
    def __init__(self, tokenization: str = "space",
                 window_size: PositiveInt = 6,
                 lowercase: bool = True,
                 ignore_pattern: Optional[str] = None,
                 num_blocks: PositiveInt = 6,
                 hamming_distance: PositiveInt = 4,
                 *args, **kwargs):

Import

from data_juicer.ops.deduplicator.document_simhash_deduplicator import DocumentSimhashDeduplicator

I/O Contract

Inputs

Name Type Required Description
tokenization str No Tokenization method: "space", "punctuation", or "character". Default: "space"
window_size PositiveInt No Window size of shingling (n-gram size). Default: 6
lowercase bool No Whether to convert text to lower case first. Default: True
ignore_pattern str No Regex pattern for sub-strings to ignore during SimHash computation. Default: None
num_blocks PositiveInt No Number of blocks in SimHash computing. Default: 6
hamming_distance PositiveInt No Max Hamming distance threshold for near-duplicate detection. Must be less than num_blocks. Default: 4

Outputs

Name Type Description
dataset Dataset Deduplicated dataset with only the first document from each similarity cluster retained
dup_pairs dict Dictionary of sampled duplicate pairs (when show_num > 0)

Usage Examples

process:
  - document_simhash_deduplicator:
      tokenization: "character"
      window_size: 4
      hamming_distance: 3
      num_blocks: 6

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment