Implementation:ChenghaoMou Text dedup SimHash Check False Positives
| Knowledge Sources | |
|---|---|
| Domains | Deduplication, Verification |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for verifying SimHash candidate pairs using exact Jaccard similarity with pure Python pairwise comparison provided by text-dedup.
Description
The check_false_positives function in simhash.py verifies candidate pairs by: (1) filtering to documents marked __duplicate__, (2) grouping by cluster into a Python dict, (3) iterating all pairs within each cluster using nested loops with tqdm, (4) computing Jaccard similarity via the n-gram tokenizer and jaccard_similarity function, (5) collecting verified pairs, and (6) re-clustering with UnionFind to produce refined cluster assignments.
Usage
Import this function when running the SimHash pipeline with check_false_positive=True.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/simhash.py
- Lines: L93-173
Signature
def check_false_positives(
config: Config,
ds: Dataset,
) -> tuple[Dataset, dict[int, int]]:
"""Check false positives using Jaccard similarity.
Parameters
----------
config : Config
Pipeline configuration with SimHash settings.
ds : Dataset
Dataset with __duplicate__ and __CLUSTER__ columns.
Returns
-------
tuple[Dataset, dict[int, int]]
Updated dataset with refined clusters, and new verified parent mapping.
"""
Import
from text_dedup.simhash import check_false_positives
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Config | Yes | Pipeline configuration with SimHash settings |
| ds | Dataset | Yes | Dataset with __duplicate__ flag and cluster assignments |
Outputs
| Name | Type | Description |
|---|---|---|
| Dataset | Dataset | Updated dataset with refined cluster assignments |
| dict[int, int] | dict | New verified parent mapping (non-trivial only) |
Usage Examples
Running SimHash Verification
from text_dedup.simhash import check_false_positives
from typing import cast
from text_dedup.config import SimHashAlgorithmConfig
algo = cast(SimHashAlgorithmConfig, config.algorithm)
if algo.check_false_positive:
ds, assignment = check_false_positives(config, ds)