Principle:ChenghaoMou Text dedup Benchmark Evaluation

Knowledge Sources	text-dedup sklearn adjusted_rand_score
Domains	Evaluation, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

A systematic evaluation methodology that measures deduplication algorithm quality using pairwise precision/recall/F1 on labeled datasets and clustering quality via adjusted Rand index.

Description

Benchmark Evaluation provides a rigorous framework for comparing deduplication algorithms against ground truth. Two evaluation methodologies are used:

CORE dataset evaluation (pairwise): For each document, ground truth specifies which other documents are duplicates. Predictions from the deduplication algorithm are compared per-document: True Positive (TP) if predicted duplicates contain all ground truth duplicates, True Negative (TN) if correctly identified as non-duplicate, False Positive (FP) if non-duplicate predicted as duplicate, False Negative (FN) if duplicate missed. Precision, recall, and macro F1 are computed from these classifications.

NEWS-COPY dataset evaluation (clustering): Ground truth provides cluster labels. The Adjusted Rand Index (ARI) measures agreement between predicted and ground truth clusterings, adjusted for chance.

Usage

Use this principle when evaluating deduplication algorithm quality on labeled benchmark datasets.

Theoretical Basis

Pairwise classification for CORE:

# Abstract evaluation logic (NOT real implementation)
for each document:
    gt_dups = ground_truth_duplicates(document)
    pred_dups = predicted_duplicates(document)
    classification = classify(gt_dups, pred_dups)
    # TP: has dups AND predicted correctly
    # FP: no dups BUT predicted some
    # FN: has dups BUT missed
    # TN: no dups AND predicted none

Adjusted Rand Index: $A R I = \frac{R I - E [R I]}{\max (R I) - E [R I]}$

Where RI is the Rand Index measuring agreement between two clusterings.

Related Pages

Implemented By

Implementation:ChenghaoMou_Text_dedup_Evaluate_Predictions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment