Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:ChenghaoMou Text dedup False Positive Verification Tradeoff

From Leeroopedia
Knowledge Sources
Domains Optimization, Text_Deduplication, Precision_Recall
Last Updated 2026-02-14 21:00 GMT

Overview

False positive verification (`check_false_positive=False` by default) is an expensive optional post-processing step that validates LSH duplicate candidates using exact Jaccard similarity.

Description

Both MinHash LSH and SimHash use locality-sensitive hashing to find candidate duplicate pairs. These candidates may include false positives (pairs flagged as similar that are not actually duplicates). The optional verification step computes exact Jaccard similarity for all candidate pairs within each cluster and removes pairs below the threshold. This step is disabled by default because it requires O(n^2) pairwise comparisons within each cluster, which can be extremely expensive on large datasets.

Usage

Enable `check_false_positive=True` when precision is critical and the dataset is small to medium-sized. Keep it disabled (default) for large-scale deduplication where the LSH approximation is sufficient. The verification step is most useful when tuning deduplication parameters on a sample dataset to understand the false positive rate.

The Insight (Rule of Thumb)

  • Action: Leave `check_false_positive=False` for production runs; enable for parameter tuning or quality-critical tasks.
  • Value: Default threshold is `jaccard_threshold=0.5` for SimHash, and uses the same `threshold` parameter for MinHash.
  • Trade-off: Verification can reduce false positives significantly but adds O(n^2) computation per cluster. On large datasets with many clusters, this becomes a bottleneck.
  • Diagnostic output: When enabled, the verification step logs the number of verified clusters/pairs, false positives, true positives, and true clusters.

Reasoning

LSH is an approximation algorithm by design. The band/row configuration in MinHash and the permutation strategy in SimHash both trade precision for speed. False positive weight (default 0.5) and false negative weight (default 0.5) in MinHash's `optimal_param()` function balance these error types equally. The verification step provides a way to recover precision at the cost of additional computation time.

The MinHash implementation uses Polars `map_elements` for pairwise Jaccard computation within clusters, while SimHash uses a pure Python approach with `tqdm` progress tracking. Both re-cluster verified pairs using UnionFind.

Code Evidence

Default configuration from `src/text_dedup/config/algorithms/minhash.py:91`:

check_false_positive: bool = False

Default configuration from `src/text_dedup/config/algorithms/simhash.py:263`:

check_false_positive: bool = False

Conditional verification in MinHash from `src/text_dedup/minhash.py:206-208`:

with timer("Verifying"):
    if algo.check_false_positive:
        ds, assignment = check_false_positives(config, ds)

Diagnostic logging from `src/text_dedup/minhash.py:145-148`:

log.info(f"Verified {cluster_num} clusters/{verified_pairs} pairs")
log.info(f"False Positives   : {total_false_positives}")
log.info(f"True Positives    : {total_true_positives}")
log.info(f"True Clusters     : {total_true_positive_clusters}")

Optimal parameter balancing from `src/text_dedup/config/algorithms/minhash.py:18-22`:

def optimal_param(
    threshold: float,
    num_perm: int,
    false_positive_weight: float = 0.5,
    false_negative_weight: float = 0.5,
) -> tuple[int, int]:

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment