Implementation:Mbzuai oryx Awesome LLM Post training Search Papers

Knowledge Sources	Awesome-LLM-Post-training Semantic Scholar API Docs
Domains	Data_Collection, Information_Retrieval
Last Updated	2026-02-08 07:30 GMT

Overview

Concrete tool for querying the Semantic Scholar Graph API to retrieve seed papers for corpus building.

Description

The search_papers function sends a keyword query to the Semantic Scholar /paper/search endpoint and returns a list of paper metadata dictionaries. It includes built-in retry logic for HTTP 429 rate-limit responses, retrying up to 3 times with a configurable wait interval. The function requests comprehensive metadata fields including title, authors, abstract, URL, TL;DR, year, venue, references, and citations.

Usage

Import and call this function as the first step of the deep paper collection pipeline. It produces the seed set that feeds into recursive reference/citation crawling via fetch_paper_details.

Code Reference

Source Location

Repository: Awesome-LLM-Post-training
File: scripts/deep_collection_sementic.py
Lines: 20-34

Signature

def search_papers(query: str, limit: int = 5) -> Optional[List[dict]]:
    """
    Search Semantic Scholar for papers matching query.

    Args:
        query: Search query string sent to Semantic Scholar API.
        limit: Maximum number of seed papers to retrieve (default 5).

    Returns:
        List of paper metadata dicts from Semantic Scholar 'data' field,
        or None on failure after retries.
    """

Import

# Function defined in scripts/deep_collection_sementic.py
# Dependencies:
import requests
import time

I/O Contract

Inputs

Name	Type	Required	Description
query	str	Yes	Search query string sent to Semantic Scholar API
limit	int	No	Maximum number of seed papers to retrieve (default 5)

Outputs

Name	Type	Description
return value	Optional[List[dict]]	List of paper metadata dicts, each containing paperId, title, authors, abstract, url, tldr, year, venue, references, citations. Returns None on failure.

Usage Examples

Basic Seed Search

# Search for papers on LLM post-training
query = "Survey on Large Language and Reinforcement Learning"
papers = search_papers(query, limit=1)

if papers:
    print(f"Found {len(papers)} seed papers")
    for paper in papers:
        print(f"  - {paper.get('title')}")
        paper_id = paper.get("paperId")
        # Feed into recursive crawling
        details = fetch_paper_details(paper_id)
else:
    print("No papers found or API error")

Broader Seed Search

# Retrieve more seed papers for wider coverage
papers = search_papers("reinforcement learning from human feedback", limit=5)

if papers:
    data = []
    for paper in papers:
        paper_id = paper.get("paperId")
        if paper_id:
            details = fetch_paper_details(paper_id)
            if details:
                data.append(details)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment