Workflow:ChenghaoMou Text dedup Benchmark Evaluation

Knowledge Sources	text-dedup text-dedup Docs
Domains	Data_Engineering, NLP, Benchmarking
Last Updated	2026-02-14 21:00 GMT

Overview

End-to-end process for evaluating deduplication algorithm quality on standard benchmark datasets (CORE and NEWS-COPY) with precision, recall, F1, and Adjusted Rand Index metrics.

Description

This workflow evaluates the MinHash and SimHash deduplication algorithms against two established benchmark datasets. The CORE benchmark (pinecone/core-2020-05-10-deduplication) evaluates pairwise duplicate detection on academic papers using precision, recall, macro F1, and accuracy. The NEWS-COPY benchmark (chenghao/NEWS-COPY-eval) evaluates clustering quality on news articles using Adjusted Rand Index (ARI). The benchmark runner loads datasets from HuggingFace Hub, prepares ground truth labels, runs the deduplication algorithms with benchmark-specific TOML configurations, loads the resulting clusters, and computes evaluation metrics.

Goal: Evaluation metrics (precision, recall, F1, ARI) and timing results for deduplication algorithms on standardized datasets.

Scope: From benchmark dataset loading through algorithm execution, cluster extraction, metric computation, and formatted result display.

Strategy: Uses a unified CLI runner that dispatches to dataset-specific benchmark functions, each of which runs the actual deduplication algorithm and evaluates the resulting clusters against ground truth.

Usage

Execute this workflow when you want to evaluate deduplication algorithm performance, compare algorithm configurations, or reproduce the published benchmark results. Use it after tuning algorithm hyperparameters (e.g., threshold, num_perm, bit_diff) to measure the impact on quality metrics. The benchmarks can be run individually (CORE-only, NEWS-only) or together, and support selecting specific algorithms (MinHash, SimHash, or both).

Execution Steps

Step 1: Benchmark Configuration

Select the benchmark dataset(s) and algorithm(s) to evaluate via CLI arguments. Load the appropriate TOML configuration files from the configs/ directory. Each benchmark has dedicated config files (e.g., benchmark_core_minhash.toml, benchmark_news_simhash.toml) with parameters tuned for the specific dataset. The config is parsed into a Config object using the same config system as the main deduplication pipelines.

Key considerations:

CLI accepts dataset (core, news, all) and algorithm (minhash, simhash, all) options
Benchmark configs are separate from the main deduplication configs
Each config includes benchmark-specific output directories and parameter tuning

Step 2: Dataset Loading and Ground Truth Preparation

Load the benchmark dataset from HuggingFace Hub and prepare ground truth labels. For CORE, extract the labelled_duplicates field and build a mapping from internal IDs to core_ids and a dictionary of known duplicate pairs. For NEWS-COPY, extract cluster labels and apply text preprocessing (news-specific normalization). The preprocessed dataset is temporarily saved to disk for the deduplication algorithm to load.

Key considerations:

CORE uses pairwise duplicate labels; NEWS-COPY uses cluster assignments
Text preprocessing is applied (e.g., lowercasing and concatenating title + abstract for CORE, news-specific cleanup for NEWS-COPY)
Progress bars are temporarily disabled during dataset loading

Step 3: Algorithm Execution

Run the selected deduplication algorithm(s) using the benchmark configuration. The algorithm's main() function is called directly (minhash_main or simhash_main), executing the full deduplication pipeline. The preprocessed benchmark dataset is saved to a temporary location and the config is adjusted to load from it. Cluster results are saved to the configured output directory.

Key considerations:

The actual deduplication algorithms are invoked programmatically, not via CLI
Temporary dataset storage handles the preprocessed benchmark data
Execution time is measured using the Timer utility

Step 4: Cluster Extraction and Metric Computation

Load the saved cluster assignments from the output directory (pickle format). For CORE, convert internal cluster IDs to core_ids and compute pairwise predictions, then evaluate precision, recall, macro F1 score, and accuracy for both duplicate and non-duplicate classes. For NEWS-COPY, convert cluster assignments to label arrays and compute the Adjusted Rand Index (ARI) against ground truth clusters.

Key considerations:

Cluster-to-prediction conversion differs between CORE (pairwise) and NEWS-COPY (clustering)
CORE metrics include per-class precision/recall and macro averages
ARI measures clustering agreement independent of label permutation

Step 5: Results Display

Format and display evaluation results in a rich-formatted table to the console. Results include all computed metrics and algorithm execution time. Separate tables are displayed for CORE and NEWS-COPY benchmarks.

Key considerations:

Uses the Rich library for formatted console output
Results can be compared against published baselines in the README
Timing information helps assess computational efficiency tradeoffs

Execution Diagram

GitHub URL

Workflow Repository