Implementation:Mbzuai oryx Awesome LLM Post training Pd Read Csv Keywords
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Data_Ingestion |
| Last Updated | 2026-02-08 07:30 GMT |
Overview
Concrete tool for loading categorized research keywords from a CSV file using pandas for trend analysis queries.
Description
The pd.read_csv call in future_research_data.py loads a CSV file containing two columns: Category (research area grouping) and Research Keyword (specific query term). The resulting DataFrame is iterated row by row in the main processing loop, with each keyword driving a set of yearly API queries against Semantic Scholar.
Usage
Call this at the start of the research trend analysis pipeline. The CSV file must exist at the specified path and must contain the required columns. The loaded DataFrame drives all subsequent API queries.
Code Reference
Source Location
- Repository: Awesome-LLM-Post-training
- File: scripts/future_research_data.py
- Lines: 27-28
Signature
# Wrapper usage of pandas.read_csv
csv_path = "assets/Keywords.csv"
prompts_df = pd.read_csv(csv_path)
# prompts_df columns: ['Category', 'Research Keyword']
Import
import pandas as pd
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| csv_path | str | Yes | Path to the keywords CSV file (hardcoded as "assets/Keywords.csv") |
Required CSV Schema:
| Column | Type | Description |
|---|---|---|
| Category | str | Research area grouping (e.g., "Reinforcement Learning", "NLP") |
| Research Keyword | str | Specific query term for Semantic Scholar search |
Outputs
| Name | Type | Description |
|---|---|---|
| prompts_df | pandas.DataFrame | DataFrame with rows of category-keyword pairs, iterated in the main loop |
Usage Examples
Loading Keywords for Trend Analysis
import pandas as pd
# Load research keywords from CSV
csv_path = "assets/Keywords.csv"
prompts_df = pd.read_csv(csv_path)
# Iterate over keywords
for index, row in prompts_df.iterrows():
category = row['Category']
keyword = row['Research Keyword']
print(f"Processing: '{keyword}' in '{category}'")
# ... query API for each keyword