Workflow:Mbzuai oryx Awesome LLM Post training Deep Paper Collection

Knowledge Sources	Awesome-LLM-Post-training Semantic Scholar API LLM Post-Training Survey
Domains	Data_Engineering, Academic_Research, Paper_Collection
Last Updated	2025-02-28 14:00 GMT

Overview

End-to-end process for automated academic paper discovery and corpus building using the Semantic Scholar API with recursive reference and citation crawling.

Description

This workflow automates the collection of academic papers related to LLM post-training research. Starting from a seed query, it searches the Semantic Scholar Graph API for initial papers, then recursively fetches their references and citations up to a configurable depth. The result is a comprehensive JSON dataset with full metadata (title, authors, abstract, TL;DR, year, venue, links) that forms the foundation for curated resource lists.

Goal: A structured JSON/Excel dataset of 1000+ papers with complete metadata and citation graphs.

Scope: From a seed search query to a deduplicated, metadata-rich paper corpus saved in both JSON and Excel formats.

Strategy: Uses breadth-first recursive crawling with deduplication tracking, rate-limit handling, and progressive checkpointing to build a robust paper collection without data loss.

Usage

Execute this workflow when you need to build or expand an academic paper corpus for a survey or awesome-list project. This is appropriate when you have a research topic query (e.g., "Large Language Models and Reinforcement Learning") and need to systematically discover the full landscape of related work, including papers reachable only through citation chains.

Execution Steps

Step 1: Configure Collection Parameters

Define the seed search query, collection limits, and API settings. This includes setting the maximum number of papers to collect, the maximum recursion depth for reference/citation traversal, rate-limit wait times, and the initial search query string.

Key considerations:

Choose a query broad enough to capture the research area but specific enough to avoid noise
Set paper limits based on available API quota and desired corpus size
Configure rate-limit wait times to comply with Semantic Scholar API terms of service

Step 2: Execute Seed Search

Submit the seed query to the Semantic Scholar Graph API search endpoint. The search returns an initial batch of papers matching the query, each with full metadata fields including title, authors, abstract, TL;DR, year, venue, references, and citations.

Key considerations:

The initial search limit controls how many seed papers begin the crawl
Retry logic handles HTTP 429 rate-limit responses automatically
If no papers are found, the workflow terminates with an error message

Step 3: Fetch Paper Details Recursively

For each seed paper, fetch its full details from the Semantic Scholar API. Then recursively follow reference and citation links up to the configured depth (typically depth 2). Each fetched paper is deduplicated against a global tracking dictionary to avoid redundant API calls.

Key considerations:

Deduplication via a processed-papers dictionary prevents re-fetching known papers
Depth limiting prevents exponential API call growth
Per-paper reference and citation counts are capped to control crawl breadth
Rate-limit retries are applied at each individual fetch

Step 4: Progressive Checkpointing

During the crawl, intermediate results are saved to a temporary JSON file at regular intervals (every 3 papers). This prevents data loss if the process is interrupted due to API errors, rate limits, or other failures.

Key considerations:

Checkpoint frequency balances I/O overhead against data-loss risk
Temporary files use a distinct name to avoid overwriting final output

Step 5: Export Final Results

Once the crawl completes or the paper limit is reached, export the full collected dataset. The primary output is a JSON file with nested paper metadata, references, and citations. A secondary Excel export is generated by flattening the JSON structure for tabular analysis.

Key considerations:

JSON preserves the full nested reference/citation structure
Excel output normalizes nested fields, which may lose some hierarchical detail
A summary report is printed with paper counts and file locations

Execution Diagram

GitHub URL

Workflow Repository