Workflow:Mbzuai oryx Awesome LLM Post training Research Trend Analysis
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Academic_Research, Data_Visualization |
| Last Updated | 2025-02-28 14:00 GMT |
Overview
End-to-end process for analyzing publication trends across research keywords by querying yearly paper counts from Semantic Scholar and generating visual bar charts.
Description
This workflow measures how research interest in specific LLM post-training topics has evolved over time. It reads a set of categorized research keywords from a CSV file, queries the Semantic Scholar API for the number of publications per year for each keyword, and produces both numerical data (JSON and Excel) and visual bar charts (PNG images). The output enables researchers to identify emerging trends, peak interest periods, and declining topics within the LLM post-training landscape.
Goal: A set of publication trend visualizations and structured data showing yearly paper counts per research keyword.
Scope: From a CSV of categorized keywords to per-keyword bar charts and a consolidated Excel workbook with one sheet per keyword.
Strategy: Uses the Semantic Scholar search API with year filters to obtain aggregate publication counts, applies rate-limit-aware querying with retries, and generates matplotlib-based visualizations with labeled data points.
Usage
Execute this workflow when you want to understand the temporal dynamics of research topics within LLM post-training. This is appropriate when you have a set of research keywords (e.g., "RLHF", "Direct Preference Optimization", "Monte Carlo Tree Search") and need quantitative evidence of how publication volume has changed year-over-year, typically for inclusion in a survey paper's introduction or related work section.
Execution Steps
Step 1: Load Research Keywords
Read a CSV file containing categorized research keywords. Each row has a category label and a specific keyword string. The CSV structure defines the scope of the trend analysis.
Key considerations:
- The CSV must have columns named "Category" and "Research Keyword"
- Keywords should be specific enough to return meaningful Semantic Scholar results
- Categories provide grouping for organizing the output
Step 2: Query Yearly Publication Counts
For each keyword, iterate over the target year range (e.g., 2020-2025) and query the Semantic Scholar API for the total number of matching papers published in each year. The API returns a total count field that is extracted without downloading individual paper records.
Key considerations:
- A polite delay (1 second) is inserted between API calls to avoid rate limiting
- Retry logic handles HTTP 429 rate-limit responses with a 10-second backoff
- Up to 10 retries are attempted per request before returning a zero count
- User-Agent headers identify the request as academic research
Step 3: Generate Trend Visualizations
For each keyword, create a bar chart showing papers published per year. Charts include labeled data values on top of each bar, axis labels, and a title incorporating the keyword and its category. Each chart is saved as a PNG image.
Key considerations:
- Bar labels display comma-formatted counts for readability
- Figure dimensions are set for presentation quality (12x6 inches)
- Grid lines on the y-axis aid visual comparison across years
Step 4: Progressively Save Structured Data
After processing each keyword, the cumulative results dictionary is saved to a JSON file. This ensures no data is lost if the process is interrupted midway through the keyword list.
Key considerations:
- Progressive saving writes after every keyword, not just at completion
- JSON structure nests yearly data under each keyword with its category
Step 5: Export Consolidated Results
After all keywords are processed, export the complete dataset to an Excel workbook. Each keyword gets its own worksheet with year and paper count columns, enabling easy filtering and comparison.
Key considerations:
- Excel sheet names are truncated to 31 characters (Excel limit)
- The openpyxl engine is used for Excel generation
- Both JSON and Excel outputs are stored in a dedicated results directory