Implementation:Mbzuai oryx Awesome LLM Post training Search Papers
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Information_Retrieval |
| Last Updated | 2026-02-08 07:30 GMT |
Overview
Concrete tool for querying the Semantic Scholar Graph API to retrieve seed papers for corpus building.
Description
The search_papers function sends a keyword query to the Semantic Scholar /paper/search endpoint and returns a list of paper metadata dictionaries. It includes built-in retry logic for HTTP 429 rate-limit responses, retrying up to 3 times with a configurable wait interval. The function requests comprehensive metadata fields including title, authors, abstract, URL, TL;DR, year, venue, references, and citations.
Usage
Import and call this function as the first step of the deep paper collection pipeline. It produces the seed set that feeds into recursive reference/citation crawling via fetch_paper_details.
Code Reference
Source Location
- Repository: Awesome-LLM-Post-training
- File: scripts/deep_collection_sementic.py
- Lines: 20-34
Signature
def search_papers(query: str, limit: int = 5) -> Optional[List[dict]]:
"""
Search Semantic Scholar for papers matching query.
Args:
query: Search query string sent to Semantic Scholar API.
limit: Maximum number of seed papers to retrieve (default 5).
Returns:
List of paper metadata dicts from Semantic Scholar 'data' field,
or None on failure after retries.
"""
Import
# Function defined in scripts/deep_collection_sementic.py
# Dependencies:
import requests
import time
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| query | str | Yes | Search query string sent to Semantic Scholar API |
| limit | int | No | Maximum number of seed papers to retrieve (default 5) |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | Optional[List[dict]] | List of paper metadata dicts, each containing paperId, title, authors, abstract, url, tldr, year, venue, references, citations. Returns None on failure. |
Usage Examples
Basic Seed Search
# Search for papers on LLM post-training
query = "Survey on Large Language and Reinforcement Learning"
papers = search_papers(query, limit=1)
if papers:
print(f"Found {len(papers)} seed papers")
for paper in papers:
print(f" - {paper.get('title')}")
paper_id = paper.get("paperId")
# Feed into recursive crawling
details = fetch_paper_details(paper_id)
else:
print("No papers found or API error")
Broader Seed Search
# Retrieve more seed papers for wider coverage
papers = search_papers("reinforcement learning from human feedback", limit=5)
if papers:
data = []
for paper in papers:
paper_id = paper.get("paperId")
if paper_id:
details = fetch_paper_details(paper_id)
if details:
data.append(details)