Implementation:ChenghaoMou Text dedup MinHash Check False Positives
| Knowledge Sources | |
|---|---|
| Domains | Deduplication, Verification |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for verifying MinHash LSH candidate pairs using exact Jaccard similarity with Polars-based pairwise comparison provided by text-dedup.
Description
The check_false_positives function in minhash.py performs verification by: (1) filtering the dataset to candidates marked as __duplicate__, (2) converting to a Polars DataFrame with index, text, and cluster columns, (3) performing a self-join on cluster to generate all pairs, (4) computing Jaccard similarity via map_elements using the n-gram tokenizer and jaccard_similarity function, (5) filtering pairs above the threshold, and (6) re-assigning cluster IDs based on verified pairs.
Usage
Import this function when running the MinHash pipeline with check_false_positive=True in the configuration.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/minhash.py
- Lines: L94-163
Signature
def check_false_positives(
config: Config,
ds: Dataset,
) -> tuple[Dataset, dict[int, int]]:
"""Check false positives using exact Jaccard similarity.
Parameters
----------
config : Config
Pipeline configuration with MinHash algorithm settings.
ds : Dataset
Dataset with __duplicate__ and __CLUSTER__ columns.
Returns
-------
tuple[Dataset, dict[int, int]]
Updated dataset with refined clusters, and new verified parent mapping.
"""
Import
from text_dedup.minhash import check_false_positives
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Config | Yes | Pipeline configuration with MinHash settings |
| ds | Dataset | Yes | Dataset with __duplicate__ flag and __CLUSTER__ column |
Outputs
| Name | Type | Description |
|---|---|---|
| Dataset | Dataset | Updated dataset with refined cluster assignments |
| dict[int, int] | dict | New verified parent mapping (document_index → cluster_id) |
Usage Examples
Running Verification
from text_dedup.minhash import check_false_positives
from typing import cast
from text_dedup.config import MinHashAlgorithmConfig
algo = cast(MinHashAlgorithmConfig, config.algorithm)
if algo.check_false_positive:
ds, assignment = check_false_positives(config, ds)