Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup SimHash Check False Positives

From Leeroopedia
Knowledge Sources
Domains Deduplication, Verification
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for verifying SimHash candidate pairs using exact Jaccard similarity with pure Python pairwise comparison provided by text-dedup.

Description

The check_false_positives function in simhash.py verifies candidate pairs by: (1) filtering to documents marked __duplicate__, (2) grouping by cluster into a Python dict, (3) iterating all pairs within each cluster using nested loops with tqdm, (4) computing Jaccard similarity via the n-gram tokenizer and jaccard_similarity function, (5) collecting verified pairs, and (6) re-clustering with UnionFind to produce refined cluster assignments.

Usage

Import this function when running the SimHash pipeline with check_false_positive=True.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/simhash.py
  • Lines: L93-173

Signature

def check_false_positives(
    config: Config,
    ds: Dataset,
) -> tuple[Dataset, dict[int, int]]:
    """Check false positives using Jaccard similarity.

    Parameters
    ----------
    config : Config
        Pipeline configuration with SimHash settings.
    ds : Dataset
        Dataset with __duplicate__ and __CLUSTER__ columns.

    Returns
    -------
    tuple[Dataset, dict[int, int]]
        Updated dataset with refined clusters, and new verified parent mapping.
    """

Import

from text_dedup.simhash import check_false_positives

I/O Contract

Inputs

Name Type Required Description
config Config Yes Pipeline configuration with SimHash settings
ds Dataset Yes Dataset with __duplicate__ flag and cluster assignments

Outputs

Name Type Description
Dataset Dataset Updated dataset with refined cluster assignments
dict[int, int] dict New verified parent mapping (non-trivial only)

Usage Examples

Running SimHash Verification

from text_dedup.simhash import check_false_positives
from typing import cast
from text_dedup.config import SimHashAlgorithmConfig

algo = cast(SimHashAlgorithmConfig, config.algorithm)
if algo.check_false_positive:
    ds, assignment = check_false_positives(config, ds)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment