Heuristic:Mbzuai oryx Awesome LLM Post training Paper Deduplication Via Dict
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Optimization |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Deduplication strategy using a global dictionary keyed by paper ID to skip already-processed papers during recursive citation graph traversal.
Description
During deep paper collection, the recursive traversal of references and citations creates a citation graph where the same paper can be reached via multiple paths. Without deduplication, the same paper would be fetched multiple times, wasting API calls and inflating the dataset. The script uses a global processed_papers dictionary keyed by Semantic Scholar paper IDs. Before fetching any paper, the function checks if the paper ID already exists in the dictionary and returns the cached result immediately if so.
Usage
Apply this heuristic when traversing any graph structure (citation networks, dependency trees, linked data) where nodes can be reached via multiple paths. The dictionary lookup provides O(1) deduplication with zero API cost for already-seen papers.
The Insight (Rule of Thumb)
- Action: Maintain a global dictionary keyed by unique identifier. Check for existence before making expensive operations (API calls, database queries).
- Value: O(1) lookup per item. In citation graph traversal, deduplication can eliminate 50-80% of redundant API calls since highly-cited papers appear in many reference lists.
- Trade-off: The dictionary grows unboundedly in memory, storing full paper metadata for each processed paper. For the configured limit of 1000 papers, this is manageable (approximately 50-100MB).
Reasoning
In academic citation graphs, highly-cited papers appear in the reference lists of many other papers. Without deduplication, a paper cited by 100 other papers would trigger 100 separate API calls, each returning identical data. The dictionary-based approach:
- Eliminates redundant API calls: Each paper is fetched exactly once regardless of how many times it appears in reference/citation lists.
- Serves as a cache: The stored metadata is returned immediately for duplicate requests.
- Doubles as the output dataset: The `processed_papers` dictionary is the primary data structure from which the final JSON/Excel exports are derived.
Code evidence from `scripts/deep_collection_sementic.py:13`:
# Dictionary to track processed papers
processed_papers = {}
Code evidence from `scripts/deep_collection_sementic.py:43-44`:
if paper_id in processed_papers:
return processed_papers[paper_id] # Skip duplicates
Code evidence from `scripts/deep_collection_sementic.py:90`:
# Track processed papers
processed_papers[paper_id] = {
"Title": title,
"Authors": authors,
"Abstract": abstract,
...
}