Implementation:Mbzuai oryx Awesome LLM Post training Fetch Paper Details
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Graph_Traversal, Bibliometrics |
| Last Updated | 2026-02-08 07:30 GMT |
Overview
Concrete tool for recursively fetching paper metadata and expanding the citation graph via the Semantic Scholar API.
Description
The fetch_paper_details function retrieves complete metadata for a single paper from the Semantic Scholar /paper/{id} endpoint, then recursively follows its references and citations up to depth 2. It uses a global processed_papers dictionary for deduplication and a global paper_count counter to enforce the collection cap. Rate-limit responses (HTTP 429) trigger automatic retry with configurable backoff.
The function builds a nested data structure where each paper's References and Citations fields contain fully-fetched detail dictionaries of linked papers, enabling rich citation graph analysis.
Usage
Call this function for each seed paper returned by search_papers. It will automatically expand the corpus by recursively crawling references and citations. Ensure global configuration variables (max_papers, max_ref_citations, rate_limit_wait, processed_papers, paper_count) are initialized before calling.
Code Reference
Source Location
- Repository: Awesome-LLM-Post-training
- File: scripts/deep_collection_sementic.py
- Lines: 37-103
Signature
def fetch_paper_details(paper_id: str, depth: int = 1) -> Optional[dict]:
"""
Fetches paper details with a limit on recursion depth.
Args:
paper_id: Semantic Scholar paper ID.
depth: Current recursion depth (stops at depth > 2).
Returns:
Dict with keys: Title, Authors, Abstract, TL;DR,
Publication Year, Venue (Conference/Journal), Link,
References (list of recursively fetched dicts),
Citations (list of recursively fetched dicts).
Returns None if paper_count >= max_papers, duplicate, or API error.
Side Effects:
Increments global paper_count.
Adds entry to global processed_papers dict.
"""
Import
# Function defined in scripts/deep_collection_sementic.py
# Dependencies:
import requests
import time
from tqdm import tqdm
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| paper_id | str | Yes | Semantic Scholar paper ID to fetch |
| depth | int | No | Current recursion depth, default 1. Recursion stops when depth > 2 |
Global State Read:
| Name | Type | Description |
|---|---|---|
| processed_papers | dict | Deduplication map of already-fetched papers |
| paper_count | int | Running count of collected papers |
| max_papers | int | Cap on total papers |
| max_ref_citations | int | Max references/citations to follow per paper |
| rate_limit_wait | int | Seconds to sleep on HTTP 429 |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | Optional[dict] | Paper metadata dict with nested References and Citations lists, or None |
Output Dict Structure:
| Key | Type | Description |
|---|---|---|
| Title | str | Paper title |
| Authors | str | Comma-separated author names |
| Abstract | str | Paper abstract |
| TL;DR | str | Auto-generated summary from Semantic Scholar |
| Publication Year | int or str | Year of publication, or "Unknown Year" |
| Venue (Conference/Journal) | str | Publication venue |
| Link | str | URL to the paper |
| References | list[dict] | Recursively fetched reference paper details |
| Citations | list[dict] | Recursively fetched citing paper details |
Usage Examples
Basic Single Paper Fetch
# Fetch details for a known paper ID
paper_id = "649def34f8be52c8b66281af98ae884c09aef38b"
details = fetch_paper_details(paper_id, depth=1)
if details:
print(f"Title: {details['Title']}")
print(f"Year: {details['Publication Year']}")
print(f"References: {len(details['References'])}")
print(f"Citations: {len(details['Citations'])}")
Integration with Seed Search
# Full pipeline: seed search → recursive fetch
query = "Survey on Large Language and Reinforcement Learning"
papers = search_papers(query, limit=1)
data = []
if papers:
for paper in papers:
paper_id = paper.get("paperId")
if paper_id:
details = fetch_paper_details(paper_id)
if details:
data.append(details)
print(f"Total papers collected: {paper_count}")
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
- Heuristic:Mbzuai_oryx_Awesome_LLM_Post_training_API_Rate_Limit_Retry_Strategy
- Heuristic:Mbzuai_oryx_Awesome_LLM_Post_training_Paper_Deduplication_Via_Dict
- Heuristic:Mbzuai_oryx_Awesome_LLM_Post_training_Depth_Limit_Recursion_At_2
- Heuristic:Mbzuai_oryx_Awesome_LLM_Post_training_Reference_Citation_Cap_200