Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:ThreeSR Awesome Inference Time Scaling Duplicate Detection By Title

From Leeroopedia
Knowledge Sources
Domains Data_Deduplication, Document_Processing
Last Updated 2026-02-14 00:00 GMT

Overview

Deduplication strategy that uses exact title matching via regex extraction to prevent duplicate paper entries in the curated README list.

Description

The write_to_readme_in_sorted_order() function implements a title-based deduplication check before inserting new papers. It builds a set of existing titles by extracting them from the current README entries using the regex pattern 🔹 \[(.*?)\], then filters incoming papers by checking whether each paper's title appears in this set. Matches are case-sensitive and exact (after stripping whitespace).

This approach is simple and effective for the repository's use case, but has known limitations: papers with slightly different title formatting (e.g., different capitalization, trailing punctuation, or subtitle variations) will not be detected as duplicates.

Usage

Use this heuristic when:

  • Adding papers programmatically via write_to_readme_in_sorted_order() to understand why some papers are skipped.
  • Debugging "Paper Already Existed!" messages to verify the deduplication is working correctly.
  • Considering edge cases where the same paper might have slightly different title strings from different API responses.

The Insight (Rule of Thumb)

  • Action: The deduplication relies on exact title string matching extracted via the regex 🔹 \[(.*?)\] from existing README entries, compared against the title field from the Semantic Scholar API response.
  • Value: Prevents the most common case of duplicate insertion (running the script multiple times with the same query).
  • Trade-off: May miss near-duplicate titles (different capitalization, punctuation, or subtitle formatting). May also produce false negatives if the API returns a slightly different title string than what is already in the README.

Reasoning

Exact title matching was chosen for simplicity and reliability. The regex 🔹 \[(.*?)\] extracts the link text from the markdown entry marker, which is the paper title as originally formatted. This is compared against the title field from the Semantic Scholar API response (with whitespace stripped).

The code evidence shows two key components:

Title extraction from existing entries (fetch_semantic_info.py:147-151):

existing_titles = set()
for entry in existing_entries:
    m = re.search(r'🔹 \[(.*?)\]', entry)
    if m:
        existing_titles.add(m.group(1).strip())

Filtering new papers against existing titles (fetch_semantic_info.py:154-160):

filtered_new_papers = []
for paper in new_papers:
    title = paper.get("title", "N/A").strip()
    if title in existing_titles:
        print("Paper Already Existed!")
    else:
        filtered_new_papers.append(paper)

Known edge cases:

  • A paper titled "Chain-of-Thought Prompting" vs "Chain-of-thought prompting" would not be detected as a duplicate (case-sensitive matching).
  • A paper whose title was manually edited in the README would not match the API response.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment