Workflow:ChenghaoMou Text dedup Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Benchmarking |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
End-to-end process for evaluating deduplication algorithm quality on standard benchmark datasets (CORE and NEWS-COPY) with precision, recall, F1, and Adjusted Rand Index metrics.
Description
This workflow evaluates the MinHash and SimHash deduplication algorithms against two established benchmark datasets. The CORE benchmark (pinecone/core-2020-05-10-deduplication) evaluates pairwise duplicate detection on academic papers using precision, recall, macro F1, and accuracy. The NEWS-COPY benchmark (chenghao/NEWS-COPY-eval) evaluates clustering quality on news articles using Adjusted Rand Index (ARI). The benchmark runner loads datasets from HuggingFace Hub, prepares ground truth labels, runs the deduplication algorithms with benchmark-specific TOML configurations, loads the resulting clusters, and computes evaluation metrics.
Goal: Evaluation metrics (precision, recall, F1, ARI) and timing results for deduplication algorithms on standardized datasets.
Scope: From benchmark dataset loading through algorithm execution, cluster extraction, metric computation, and formatted result display.
Strategy: Uses a unified CLI runner that dispatches to dataset-specific benchmark functions, each of which runs the actual deduplication algorithm and evaluates the resulting clusters against ground truth.
Usage
Execute this workflow when you want to evaluate deduplication algorithm performance, compare algorithm configurations, or reproduce the published benchmark results. Use it after tuning algorithm hyperparameters (e.g., threshold, num_perm, bit_diff) to measure the impact on quality metrics. The benchmarks can be run individually (CORE-only, NEWS-only) or together, and support selecting specific algorithms (MinHash, SimHash, or both).
Execution Steps
Step 1: Benchmark Configuration
Select the benchmark dataset(s) and algorithm(s) to evaluate via CLI arguments. Load the appropriate TOML configuration files from the configs/ directory. Each benchmark has dedicated config files (e.g., benchmark_core_minhash.toml, benchmark_news_simhash.toml) with parameters tuned for the specific dataset. The config is parsed into a Config object using the same config system as the main deduplication pipelines.
Key considerations:
- CLI accepts dataset (core, news, all) and algorithm (minhash, simhash, all) options
- Benchmark configs are separate from the main deduplication configs
- Each config includes benchmark-specific output directories and parameter tuning
Step 2: Dataset Loading and Ground Truth Preparation
Load the benchmark dataset from HuggingFace Hub and prepare ground truth labels. For CORE, extract the labelled_duplicates field and build a mapping from internal IDs to core_ids and a dictionary of known duplicate pairs. For NEWS-COPY, extract cluster labels and apply text preprocessing (news-specific normalization). The preprocessed dataset is temporarily saved to disk for the deduplication algorithm to load.
Key considerations:
- CORE uses pairwise duplicate labels; NEWS-COPY uses cluster assignments
- Text preprocessing is applied (e.g., lowercasing and concatenating title + abstract for CORE, news-specific cleanup for NEWS-COPY)
- Progress bars are temporarily disabled during dataset loading
Step 3: Algorithm Execution
Run the selected deduplication algorithm(s) using the benchmark configuration. The algorithm's main() function is called directly (minhash_main or simhash_main), executing the full deduplication pipeline. The preprocessed benchmark dataset is saved to a temporary location and the config is adjusted to load from it. Cluster results are saved to the configured output directory.
Key considerations:
- The actual deduplication algorithms are invoked programmatically, not via CLI
- Temporary dataset storage handles the preprocessed benchmark data
- Execution time is measured using the Timer utility
Step 4: Cluster Extraction and Metric Computation
Load the saved cluster assignments from the output directory (pickle format). For CORE, convert internal cluster IDs to core_ids and compute pairwise predictions, then evaluate precision, recall, macro F1 score, and accuracy for both duplicate and non-duplicate classes. For NEWS-COPY, convert cluster assignments to label arrays and compute the Adjusted Rand Index (ARI) against ground truth clusters.
Key considerations:
- Cluster-to-prediction conversion differs between CORE (pairwise) and NEWS-COPY (clustering)
- CORE metrics include per-class precision/recall and macro averages
- ARI measures clustering agreement independent of label permutation
Step 5: Results Display
Format and display evaluation results in a rich-formatted table to the console. Results include all computed metrics and algorithm execution time. Separate tables are displayed for CORE and NEWS-COPY benchmarks.
Key considerations:
- Uses the Rich library for formatted console output
- Results can be compared against published baselines in the README
- Timing information helps assess computational efficiency tradeoffs