Workflow:ChenghaoMou Text dedup SimHash Deduplication

Knowledge Sources	text-dedup Detecting Near-Duplicates for Web Crawling text-dedup Docs
Domains	Data_Engineering, NLP, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

End-to-end process for near-duplicate text detection using SimHash fingerprinting with bit-difference bucketing and Union-Find clustering.

Description

This workflow detects near-duplicate documents by computing SimHash fingerprints (fixed-length bit vectors that preserve cosine similarity) and grouping documents whose fingerprints differ by at most a configurable number of bits. Unlike MinHash which approximates Jaccard similarity, SimHash captures weighted token frequency similarity. Documents are placed into buckets using bit permutations, then within each bucket, pairs with Hamming distance below the threshold are merged using a Union-Find data structure. An optional Jaccard verification step removes false positives. Supports both 64-bit and 128-bit hash modes.

Goal: A deduplicated dataset with near-duplicate documents removed based on SimHash fingerprint similarity.

Scope: From raw text data through SimHash fingerprinting, bit-difference bucketing, Union-Find clustering, and filtered output.

Strategy: Uses SimHash for compact document representation and bit permutations for efficient candidate generation, with Union-Find for transitive closure of duplicate relationships.

Usage

Execute this workflow when you need near-duplicate detection with a focus on weighted token similarity rather than set overlap. SimHash is particularly effective for longer documents where token frequency matters. It works well when the bit-difference threshold can be tuned to the expected level of document similarity. Use this when MinHash's set-based Jaccard similarity is not the ideal similarity metric for your data.

Execution Steps

Step 1: Configuration Loading

Parse the TOML configuration file into a typed Config object. The SimHash-specific configuration includes hash bit width (64 or 128), n-gram size, bit-difference threshold, and bucket permutation parameters. The config automatically generates bit permutation masks for the bucketing scheme based on the hash width and bit-difference settings.

Key considerations:

Hash bits can be 64 or 128, affecting fingerprint granularity
The bit_diff parameter controls how many bit differences are tolerated
Bucket permutations are auto-generated from hash bits and bit_diff settings

Step 2: Data Loading

Load the dataset using the unified data I/O layer from local files or HuggingFace datasets. Each document receives an internal index column. Unlike MinHash, SimHash does not pre-filter short documents as the fingerprint computation handles all document lengths.

Key considerations:

Same data loading abstraction as all other algorithms
Internal index column added for cluster tracking
No length-based pre-filtering is applied

Step 3: SimHash Fingerprinting

Compute SimHash fingerprints for each document using parallel map operations. Each document is tokenized into n-grams, and a weighted bit vector is accumulated by hashing each n-gram and adding/subtracting from dimension accumulators. The final sign of each dimension produces the binary fingerprint. The fingerprint is then split into bucket keys using bit permutations.

Key considerations:

N-gram tokenization determines the feature space
Hash function maps each n-gram to a bit vector contribution
Output includes bucket keys and fingerprint values for the clustering step

Step 4: Bit-Difference Bucketing and Union-Find Clustering

Iterate through all documents grouped by their bucket keys. Within each bucket, compare the Hamming distance between fingerprint pairs. When two documents have a Hamming distance at or below the bit_diff threshold, they are merged in a Union-Find data structure. After processing all buckets, extract connected components to produce cluster assignments. Only non-trivial clusters (where a document maps to a different representative) are retained.

Key considerations:

Bucketing reduces the comparison space from O(n^2) to manageable bucket sizes
Union-Find provides efficient transitive closure of duplicate relationships
The sequential iteration is necessary because Union-Find is stateful

Step 5: False Positive Verification

Optionally verify candidate clusters by computing actual Jaccard similarity on n-gram token sets. For each cluster, all pairs within the cluster are compared. Verified pairs are re-merged using a fresh Union-Find to produce refined cluster assignments. Statistics on false positives and true positives are logged.

Key considerations:

Optional step controlled by check_false_positive flag
Uses Jaccard similarity with a configurable jaccard_threshold
Rebuilds clusters from scratch using only verified pairs

Step 6: Duplicate Removal and Output

Filter the dataset to remove documents marked as duplicates (keeping only non-duplicate representatives). Save the deduplicated dataset and optionally export cluster assignments. Clean up cache files if configured.

Key considerations:

Filtering removes documents flagged as duplicates (not cluster representatives)
Output format is HuggingFace Dataset saved to disk
Cluster metadata can be preserved for downstream analysis

Execution Diagram

GitHub URL

Workflow Repository