Workflow:ChenghaoMou Text dedup SimHash Deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
End-to-end process for near-duplicate text detection using SimHash fingerprinting with bit-difference bucketing and Union-Find clustering.
Description
This workflow detects near-duplicate documents by computing SimHash fingerprints (fixed-length bit vectors that preserve cosine similarity) and grouping documents whose fingerprints differ by at most a configurable number of bits. Unlike MinHash which approximates Jaccard similarity, SimHash captures weighted token frequency similarity. Documents are placed into buckets using bit permutations, then within each bucket, pairs with Hamming distance below the threshold are merged using a Union-Find data structure. An optional Jaccard verification step removes false positives. Supports both 64-bit and 128-bit hash modes.
Goal: A deduplicated dataset with near-duplicate documents removed based on SimHash fingerprint similarity.
Scope: From raw text data through SimHash fingerprinting, bit-difference bucketing, Union-Find clustering, and filtered output.
Strategy: Uses SimHash for compact document representation and bit permutations for efficient candidate generation, with Union-Find for transitive closure of duplicate relationships.
Usage
Execute this workflow when you need near-duplicate detection with a focus on weighted token similarity rather than set overlap. SimHash is particularly effective for longer documents where token frequency matters. It works well when the bit-difference threshold can be tuned to the expected level of document similarity. Use this when MinHash's set-based Jaccard similarity is not the ideal similarity metric for your data.
Execution Steps
Step 1: Configuration Loading
Parse the TOML configuration file into a typed Config object. The SimHash-specific configuration includes hash bit width (64 or 128), n-gram size, bit-difference threshold, and bucket permutation parameters. The config automatically generates bit permutation masks for the bucketing scheme based on the hash width and bit-difference settings.
Key considerations:
- Hash bits can be 64 or 128, affecting fingerprint granularity
- The bit_diff parameter controls how many bit differences are tolerated
- Bucket permutations are auto-generated from hash bits and bit_diff settings
Step 2: Data Loading
Load the dataset using the unified data I/O layer from local files or HuggingFace datasets. Each document receives an internal index column. Unlike MinHash, SimHash does not pre-filter short documents as the fingerprint computation handles all document lengths.
Key considerations:
- Same data loading abstraction as all other algorithms
- Internal index column added for cluster tracking
- No length-based pre-filtering is applied
Step 3: SimHash Fingerprinting
Compute SimHash fingerprints for each document using parallel map operations. Each document is tokenized into n-grams, and a weighted bit vector is accumulated by hashing each n-gram and adding/subtracting from dimension accumulators. The final sign of each dimension produces the binary fingerprint. The fingerprint is then split into bucket keys using bit permutations.
Key considerations:
- N-gram tokenization determines the feature space
- Hash function maps each n-gram to a bit vector contribution
- Output includes bucket keys and fingerprint values for the clustering step
Step 4: Bit-Difference Bucketing and Union-Find Clustering
Iterate through all documents grouped by their bucket keys. Within each bucket, compare the Hamming distance between fingerprint pairs. When two documents have a Hamming distance at or below the bit_diff threshold, they are merged in a Union-Find data structure. After processing all buckets, extract connected components to produce cluster assignments. Only non-trivial clusters (where a document maps to a different representative) are retained.
Key considerations:
- Bucketing reduces the comparison space from O(n^2) to manageable bucket sizes
- Union-Find provides efficient transitive closure of duplicate relationships
- The sequential iteration is necessary because Union-Find is stateful
Step 5: False Positive Verification
Optionally verify candidate clusters by computing actual Jaccard similarity on n-gram token sets. For each cluster, all pairs within the cluster are compared. Verified pairs are re-merged using a fresh Union-Find to produce refined cluster assignments. Statistics on false positives and true positives are logged.
Key considerations:
- Optional step controlled by check_false_positive flag
- Uses Jaccard similarity with a configurable jaccard_threshold
- Rebuilds clusters from scratch using only verified pairs
Step 6: Duplicate Removal and Output
Filter the dataset to remove documents marked as duplicates (keeping only non-duplicate representatives). Save the deduplicated dataset and optionally export cluster assignments. Clean up cache files if configured.
Key considerations:
- Filtering removes documents flagged as duplicates (not cluster representatives)
- Output format is HuggingFace Dataset saved to disk
- Cluster metadata can be preserved for downstream analysis