Workflow:Mbzuai oryx Awesome LLM Post training Research Trend Analysis

Knowledge Sources	Awesome-LLM-Post-training Semantic Scholar API LLM Post-Training Survey
Domains	Data_Engineering, Academic_Research, Data_Visualization
Last Updated	2025-02-28 14:00 GMT

Overview

End-to-end process for analyzing publication trends across research keywords by querying yearly paper counts from Semantic Scholar and generating visual bar charts.

Description

This workflow measures how research interest in specific LLM post-training topics has evolved over time. It reads a set of categorized research keywords from a CSV file, queries the Semantic Scholar API for the number of publications per year for each keyword, and produces both numerical data (JSON and Excel) and visual bar charts (PNG images). The output enables researchers to identify emerging trends, peak interest periods, and declining topics within the LLM post-training landscape.

Goal: A set of publication trend visualizations and structured data showing yearly paper counts per research keyword.

Scope: From a CSV of categorized keywords to per-keyword bar charts and a consolidated Excel workbook with one sheet per keyword.

Strategy: Uses the Semantic Scholar search API with year filters to obtain aggregate publication counts, applies rate-limit-aware querying with retries, and generates matplotlib-based visualizations with labeled data points.

Usage

Execute this workflow when you want to understand the temporal dynamics of research topics within LLM post-training. This is appropriate when you have a set of research keywords (e.g., "RLHF", "Direct Preference Optimization", "Monte Carlo Tree Search") and need quantitative evidence of how publication volume has changed year-over-year, typically for inclusion in a survey paper's introduction or related work section.

Execution Steps

Step 1: Load Research Keywords

Read a CSV file containing categorized research keywords. Each row has a category label and a specific keyword string. The CSV structure defines the scope of the trend analysis.

Key considerations:

The CSV must have columns named "Category" and "Research Keyword"
Keywords should be specific enough to return meaningful Semantic Scholar results
Categories provide grouping for organizing the output

Step 2: Query Yearly Publication Counts

For each keyword, iterate over the target year range (e.g., 2020-2025) and query the Semantic Scholar API for the total number of matching papers published in each year. The API returns a total count field that is extracted without downloading individual paper records.

Key considerations:

A polite delay (1 second) is inserted between API calls to avoid rate limiting
Retry logic handles HTTP 429 rate-limit responses with a 10-second backoff
Up to 10 retries are attempted per request before returning a zero count
User-Agent headers identify the request as academic research

Step 3: Generate Trend Visualizations

For each keyword, create a bar chart showing papers published per year. Charts include labeled data values on top of each bar, axis labels, and a title incorporating the keyword and its category. Each chart is saved as a PNG image.

Key considerations:

Bar labels display comma-formatted counts for readability
Figure dimensions are set for presentation quality (12x6 inches)
Grid lines on the y-axis aid visual comparison across years

Step 4: Progressively Save Structured Data

After processing each keyword, the cumulative results dictionary is saved to a JSON file. This ensures no data is lost if the process is interrupted midway through the keyword list.

Key considerations:

Progressive saving writes after every keyword, not just at completion
JSON structure nests yearly data under each keyword with its category

Step 5: Export Consolidated Results

After all keywords are processed, export the complete dataset to an Excel workbook. Each keyword gets its own worksheet with year and paper count columns, enabling easy filtering and comparison.

Key considerations:

Excel sheet names are truncated to 31 characters (Excel limit)
The openpyxl engine is used for Excel generation
Both JSON and Excel outputs are stored in a dedicated results directory

Execution Diagram

GitHub URL

Workflow Repository