Workflow:Mbzuai oryx Awesome LLM Post training Deep Paper Collection
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Academic_Research, Paper_Collection |
| Last Updated | 2025-02-28 14:00 GMT |
Overview
End-to-end process for automated academic paper discovery and corpus building using the Semantic Scholar API with recursive reference and citation crawling.
Description
This workflow automates the collection of academic papers related to LLM post-training research. Starting from a seed query, it searches the Semantic Scholar Graph API for initial papers, then recursively fetches their references and citations up to a configurable depth. The result is a comprehensive JSON dataset with full metadata (title, authors, abstract, TL;DR, year, venue, links) that forms the foundation for curated resource lists.
Goal: A structured JSON/Excel dataset of 1000+ papers with complete metadata and citation graphs.
Scope: From a seed search query to a deduplicated, metadata-rich paper corpus saved in both JSON and Excel formats.
Strategy: Uses breadth-first recursive crawling with deduplication tracking, rate-limit handling, and progressive checkpointing to build a robust paper collection without data loss.
Usage
Execute this workflow when you need to build or expand an academic paper corpus for a survey or awesome-list project. This is appropriate when you have a research topic query (e.g., "Large Language Models and Reinforcement Learning") and need to systematically discover the full landscape of related work, including papers reachable only through citation chains.
Execution Steps
Step 1: Configure Collection Parameters
Define the seed search query, collection limits, and API settings. This includes setting the maximum number of papers to collect, the maximum recursion depth for reference/citation traversal, rate-limit wait times, and the initial search query string.
Key considerations:
- Choose a query broad enough to capture the research area but specific enough to avoid noise
- Set paper limits based on available API quota and desired corpus size
- Configure rate-limit wait times to comply with Semantic Scholar API terms of service
Step 2: Execute Seed Search
Submit the seed query to the Semantic Scholar Graph API search endpoint. The search returns an initial batch of papers matching the query, each with full metadata fields including title, authors, abstract, TL;DR, year, venue, references, and citations.
Key considerations:
- The initial search limit controls how many seed papers begin the crawl
- Retry logic handles HTTP 429 rate-limit responses automatically
- If no papers are found, the workflow terminates with an error message
Step 3: Fetch Paper Details Recursively
For each seed paper, fetch its full details from the Semantic Scholar API. Then recursively follow reference and citation links up to the configured depth (typically depth 2). Each fetched paper is deduplicated against a global tracking dictionary to avoid redundant API calls.
Key considerations:
- Deduplication via a processed-papers dictionary prevents re-fetching known papers
- Depth limiting prevents exponential API call growth
- Per-paper reference and citation counts are capped to control crawl breadth
- Rate-limit retries are applied at each individual fetch
Step 4: Progressive Checkpointing
During the crawl, intermediate results are saved to a temporary JSON file at regular intervals (every 3 papers). This prevents data loss if the process is interrupted due to API errors, rate limits, or other failures.
Key considerations:
- Checkpoint frequency balances I/O overhead against data-loss risk
- Temporary files use a distinct name to avoid overwriting final output
Step 5: Export Final Results
Once the crawl completes or the paper limit is reached, export the full collected dataset. The primary output is a JSON file with nested paper metadata, references, and citations. A secondary Excel export is generated by flattening the JSON structure for tabular analysis.
Key considerations:
- JSON preserves the full nested reference/citation structure
- Excel output normalizes nested fields, which may lose some hierarchical detail
- A summary report is printed with paper counts and file locations