Implementation:Mbzuai oryx Awesome LLM Post training Fetch Paper Details

Knowledge Sources	Awesome-LLM-Post-training Semantic Scholar API Docs
Domains	Data_Collection, Graph_Traversal, Bibliometrics
Last Updated	2026-02-08 07:30 GMT

Overview

Concrete tool for recursively fetching paper metadata and expanding the citation graph via the Semantic Scholar API.

Description

The fetch_paper_details function retrieves complete metadata for a single paper from the Semantic Scholar /paper/{id} endpoint, then recursively follows its references and citations up to depth 2. It uses a global processed_papers dictionary for deduplication and a global paper_count counter to enforce the collection cap. Rate-limit responses (HTTP 429) trigger automatic retry with configurable backoff.

The function builds a nested data structure where each paper's References and Citations fields contain fully-fetched detail dictionaries of linked papers, enabling rich citation graph analysis.

Usage

Call this function for each seed paper returned by search_papers. It will automatically expand the corpus by recursively crawling references and citations. Ensure global configuration variables (max_papers, max_ref_citations, rate_limit_wait, processed_papers, paper_count) are initialized before calling.

Code Reference

Source Location

Repository: Awesome-LLM-Post-training
File: scripts/deep_collection_sementic.py
Lines: 37-103

Signature

def fetch_paper_details(paper_id: str, depth: int = 1) -> Optional[dict]:
    """
    Fetches paper details with a limit on recursion depth.

    Args:
        paper_id: Semantic Scholar paper ID.
        depth: Current recursion depth (stops at depth > 2).

    Returns:
        Dict with keys: Title, Authors, Abstract, TL;DR,
        Publication Year, Venue (Conference/Journal), Link,
        References (list of recursively fetched dicts),
        Citations (list of recursively fetched dicts).
        Returns None if paper_count >= max_papers, duplicate, or API error.

    Side Effects:
        Increments global paper_count.
        Adds entry to global processed_papers dict.
    """

Import

# Function defined in scripts/deep_collection_sementic.py
# Dependencies:
import requests
import time
from tqdm import tqdm

I/O Contract

Inputs

Name	Type	Required	Description
paper_id	str	Yes	Semantic Scholar paper ID to fetch
depth	int	No	Current recursion depth, default 1. Recursion stops when depth > 2

Global State Read:

Name	Type	Description
processed_papers	dict	Deduplication map of already-fetched papers
paper_count	int	Running count of collected papers
max_papers	int	Cap on total papers
max_ref_citations	int	Max references/citations to follow per paper
rate_limit_wait	int	Seconds to sleep on HTTP 429

Outputs

Name	Type	Description
return value	Optional[dict]	Paper metadata dict with nested References and Citations lists, or None

Output Dict Structure:

Key	Type	Description
Title	str	Paper title
Authors	str	Comma-separated author names
Abstract	str	Paper abstract
TL;DR	str	Auto-generated summary from Semantic Scholar
Publication Year	int or str	Year of publication, or "Unknown Year"
Venue (Conference/Journal)	str	Publication venue
Link	str	URL to the paper
References	list[dict]	Recursively fetched reference paper details
Citations	list[dict]	Recursively fetched citing paper details

Usage Examples

Basic Single Paper Fetch

# Fetch details for a known paper ID
paper_id = "649def34f8be52c8b66281af98ae884c09aef38b"
details = fetch_paper_details(paper_id, depth=1)

if details:
    print(f"Title: {details['Title']}")
    print(f"Year: {details['Publication Year']}")
    print(f"References: {len(details['References'])}")
    print(f"Citations: {len(details['Citations'])}")

Integration with Seed Search

# Full pipeline: seed search → recursive fetch
query = "Survey on Large Language and Reinforcement Learning"
papers = search_papers(query, limit=1)

data = []
if papers:
    for paper in papers:
        paper_id = paper.get("paperId")
        if paper_id:
            details = fetch_paper_details(paper_id)
            if details:
                data.append(details)

print(f"Total papers collected: {paper_count}")

Related Pages

Implements Principle

Principle:Mbzuai_oryx_Awesome_LLM_Post_training_Recursive_Paper_Fetching

Requires Environment

Environment:Mbzuai_oryx_Awesome_LLM_Post_training_Python_Requests

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment