Implementation:Mbzuai oryx Awesome LLM Post training Get Paper Count
| Knowledge Sources | |
|---|---|
| Domains | Bibliometrics, Trend_Analysis |
| Last Updated | 2026-02-08 07:30 GMT |
Overview
Concrete tool for querying yearly publication counts from the Semantic Scholar API for research trend analysis.
Description
The get_paper_count function queries the Semantic Scholar /paper/search endpoint with a keyword and year filter, requesting only 1 result (limit=1) to minimize data transfer while extracting the total count from the response. It includes aggressive retry logic (up to 10 retries) for HTTP 429 rate-limit responses with a 10-second sleep between attempts. A custom User-Agent header is set to identify the request as academic research.
Usage
Call this function for each keyword-year combination in the trend analysis loop. It is called within a nested loop: outer loop over keywords (from CSV), inner loop over years (2020-2025). A 1-second delay between calls is applied externally to be polite to the API.
Code Reference
Source Location
- Repository: Awesome-LLM-Post-training
- File: scripts/future_research_data.py
- Lines: 8-24
Signature
def get_paper_count(query: str, year: int) -> int:
"""
Get number of papers for a given query and year from Semantic Scholar.
Args:
query: Research keyword to search for.
year: Publication year filter.
Returns:
int: Total number of papers matching the query for that year.
Returns 0 on error or retry exhaustion.
"""
Import
# Function defined in scripts/future_research_data.py
# Dependencies:
import requests
import time
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| query | str | Yes | Research keyword to search for |
| year | int | Yes | Publication year filter (e.g., 2023) |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | int | Total number of papers matching query for the specified year. Returns 0 on error. |
Usage Examples
Single Query
# Get paper count for a specific keyword and year
count = get_paper_count("reinforcement learning from human feedback", 2023)
print(f"RLHF papers in 2023: {count}")
Full Trend Analysis Loop
import time
keywords = ["RLHF", "Direct Preference Optimization", "MCTS for LLM"]
years = list(range(2020, 2026))
for keyword in keywords:
counts = []
for year in years:
count = get_paper_count(keyword, year)
counts.append(count)
time.sleep(1) # Polite delay between requests
print(f"{keyword}: {dict(zip(years, counts))}")