Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup MinHash Check False Positives

From Leeroopedia
Knowledge Sources
Domains Deduplication, Verification
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for verifying MinHash LSH candidate pairs using exact Jaccard similarity with Polars-based pairwise comparison provided by text-dedup.

Description

The check_false_positives function in minhash.py performs verification by: (1) filtering the dataset to candidates marked as __duplicate__, (2) converting to a Polars DataFrame with index, text, and cluster columns, (3) performing a self-join on cluster to generate all pairs, (4) computing Jaccard similarity via map_elements using the n-gram tokenizer and jaccard_similarity function, (5) filtering pairs above the threshold, and (6) re-assigning cluster IDs based on verified pairs.

Usage

Import this function when running the MinHash pipeline with check_false_positive=True in the configuration.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/minhash.py
  • Lines: L94-163

Signature

def check_false_positives(
    config: Config,
    ds: Dataset,
) -> tuple[Dataset, dict[int, int]]:
    """Check false positives using exact Jaccard similarity.

    Parameters
    ----------
    config : Config
        Pipeline configuration with MinHash algorithm settings.
    ds : Dataset
        Dataset with __duplicate__ and __CLUSTER__ columns.

    Returns
    -------
    tuple[Dataset, dict[int, int]]
        Updated dataset with refined clusters, and new verified parent mapping.
    """

Import

from text_dedup.minhash import check_false_positives

I/O Contract

Inputs

Name Type Required Description
config Config Yes Pipeline configuration with MinHash settings
ds Dataset Yes Dataset with __duplicate__ flag and __CLUSTER__ column

Outputs

Name Type Description
Dataset Dataset Updated dataset with refined cluster assignments
dict[int, int] dict New verified parent mapping (document_index → cluster_id)

Usage Examples

Running Verification

from text_dedup.minhash import check_false_positives
from typing import cast
from text_dedup.config import MinHashAlgorithmConfig

algo = cast(MinHashAlgorithmConfig, config.algorithm)
if algo.check_false_positive:
    ds, assignment = check_false_positives(config, ds)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment