Principle:ChenghaoMou Text dedup Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A systematic evaluation methodology that measures deduplication algorithm quality using pairwise precision/recall/F1 on labeled datasets and clustering quality via adjusted Rand index.
Description
Benchmark Evaluation provides a rigorous framework for comparing deduplication algorithms against ground truth. Two evaluation methodologies are used:
CORE dataset evaluation (pairwise): For each document, ground truth specifies which other documents are duplicates. Predictions from the deduplication algorithm are compared per-document: True Positive (TP) if predicted duplicates contain all ground truth duplicates, True Negative (TN) if correctly identified as non-duplicate, False Positive (FP) if non-duplicate predicted as duplicate, False Negative (FN) if duplicate missed. Precision, recall, and macro F1 are computed from these classifications.
NEWS-COPY dataset evaluation (clustering): Ground truth provides cluster labels. The Adjusted Rand Index (ARI) measures agreement between predicted and ground truth clusterings, adjusted for chance.
Usage
Use this principle when evaluating deduplication algorithm quality on labeled benchmark datasets.
Theoretical Basis
Pairwise classification for CORE:
# Abstract evaluation logic (NOT real implementation)
for each document:
gt_dups = ground_truth_duplicates(document)
pred_dups = predicted_duplicates(document)
classification = classify(gt_dups, pred_dups)
# TP: has dups AND predicted correctly
# FP: no dups BUT predicted some
# FN: has dups BUT missed
# TN: no dups AND predicted none
Adjusted Rand Index:
Where RI is the Rand Index measuring agreement between two clusterings.