Environment:Mbzuai oryx Awesome LLM Post training Python Requests
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Infrastructure |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Python 3.x environment with requests, json, time, tqdm, and pandas for querying the Semantic Scholar API and processing paper metadata.
Description
This environment provides the core runtime for the deep paper collection and research trend analysis scripts. It includes the requests library for HTTP GET calls to the Semantic Scholar Graph API, time for rate-limit backoff delays, tqdm for progress bars during recursive paper fetching, json for checkpoint serialization, and pandas for data normalization and Excel export. The os module is used for directory creation and path management.
Usage
Use this environment for any workflow that queries the Semantic Scholar API, including seed paper search (search_papers), recursive paper detail fetching (fetch_paper_details), publication count querying (get_paper_count), and JSON/Excel export operations. It is the mandatory prerequisite for running deep_collection_sementic.py and future_research_data.py.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Any (Linux, macOS, Windows) | No OS-specific dependencies |
| Hardware | Standard CPU | No GPU required; network access required for API calls |
| Network | Internet access | Must reach api.semanticscholar.org on HTTPS (port 443) |
| Disk | 1GB free | For JSON checkpoint files and Excel output |
Dependencies
System Packages
- No system-level packages required beyond Python itself
Python Packages
- `python` >= 3.6
- `requests` (any recent version)
- `pandas` (any recent version)
- `tqdm` (any recent version)
- `openpyxl` (required by pandas for Excel export)
Credentials
No API keys are required for the Semantic Scholar API at the basic rate tier. However, rate limits apply (see Common Errors).
Optional:
- `S2_API_KEY`: Semantic Scholar API key for higher rate limits. Not used in the current scripts but recommended for production use.
Quick Install
# Install all required packages
pip install requests pandas tqdm openpyxl
Code Evidence
Import statements from `scripts/deep_collection_sementic.py:1-6`:
import requests
import json
import time
import os
from tqdm import tqdm
import pandas as pd
Import statements from `scripts/future_research_data.py:1-6`:
import os
import json
import requests
import time
import pandas as pd
import matplotlib.pyplot as plt
HTTP request pattern from `scripts/deep_collection_sementic.py:21-27`:
url = f"https://api.semanticscholar.org/graph/v1/paper/search?query={query}&limit={limit}&fields=title,authors,abstract,url,tldr,year,venue,references,citations"
for _ in range(3): # Retry up to 3 times if 429 error occurs
response = requests.get(url)
if response.status_code == 200:
return response.json().get("data", [])
User-Agent header from `scripts/future_research_data.py:11`:
headers = {'User-Agent': 'AcademicResearch/1.0 (mailto:user@example.com)'}
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ModuleNotFoundError: No module named 'requests'` | requests not installed | `pip install requests` |
| `ModuleNotFoundError: No module named 'tqdm'` | tqdm not installed | `pip install tqdm` |
| `ModuleNotFoundError: No module named 'openpyxl'` | openpyxl not installed (needed by pandas for .xlsx) | `pip install openpyxl` |
| HTTP 429 Rate limit exceeded | Too many API requests to Semantic Scholar | Scripts have built-in retry logic with 10s backoff; consider adding an API key for higher limits |
| `ConnectionError` / `requests.exceptions.ConnectionError` | No internet access or API endpoint unreachable | Check network connectivity to api.semanticscholar.org |
Compatibility Notes
- All platforms: Works on Linux, macOS, and Windows without modification.
- Python version: Requires Python 3.6+ for f-string support used throughout the scripts.
- Semantic Scholar API: Free tier has rate limits (approximately 100 requests per 5 minutes). The scripts include retry logic but no API key authentication. For heavy usage, obtain an API key from Semantic Scholar.
- Encoding: JSON files are written with `encoding="utf-8"` for international character support in paper titles and abstracts.
Related Pages
- Implementation:Mbzuai_oryx_Awesome_LLM_Post_training_Collection_Config_Variables
- Implementation:Mbzuai_oryx_Awesome_LLM_Post_training_Search_Papers
- Implementation:Mbzuai_oryx_Awesome_LLM_Post_training_Fetch_Paper_Details
- Implementation:Mbzuai_oryx_Awesome_LLM_Post_training_Json_Dump_Checkpoint
- Implementation:Mbzuai_oryx_Awesome_LLM_Post_training_Json_Normalize_Excel_Export
- Implementation:Mbzuai_oryx_Awesome_LLM_Post_training_Json_Load_Corpus
- Implementation:Mbzuai_oryx_Awesome_LLM_Post_training_Get_Paper_Count