Implementation:Mbzuai oryx Awesome LLM Post training Collection Config Variables
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Configuration |
| Last Updated | 2026-02-08 07:30 GMT |
Overview
Concrete configuration pattern for establishing global collection parameters in the deep paper collection script.
Description
The module-level configuration block in deep_collection_sementic.py defines six global variables that govern the entire paper crawling pipeline. These include the output directory path, a deduplication dictionary, a paper count cap, a running counter, a per-paper reference/citation limit, and an API rate-limit wait time. The block also eagerly creates the output directory using os.makedirs.
Usage
Set these variables before running the paper collection pipeline. Adjust max_papers to control corpus size, max_ref_citations to control breadth of reference/citation crawling, and rate_limit_wait to comply with Semantic Scholar API rate limits.
Code Reference
Source Location
- Repository: Awesome-LLM-Post-training
- File: scripts/deep_collection_sementic.py
- Lines: 8-17
Signature
# Module-level configuration variables
PDF_FOLDER = "Downloaded_Papers" # str: output directory name
os.makedirs(PDF_FOLDER, exist_ok=True) # creates directory eagerly
processed_papers = {} # dict: tracks processed paper IDs -> details
max_papers = 1000 # int: cap on total papers to collect
paper_count = 0 # int: running counter of collected papers
max_ref_citations = 200 # int: max references/citations per paper
rate_limit_wait = 10 # int: seconds to wait on HTTP 429
Import
import os
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (hardcoded values) | Various | Yes | All values are defined as literals at module level; no external input required |
Outputs
| Name | Type | Description |
|---|---|---|
| PDF_FOLDER | str | Directory name for downloaded papers (created on disk) |
| processed_papers | dict | Empty dict used for deduplication across recursive fetches |
| max_papers | int | Upper bound on total papers the pipeline will collect |
| paper_count | int | Running counter (starts at 0), incremented by fetch_paper_details |
| max_ref_citations | int | Maximum references and citations fetched per individual paper |
| rate_limit_wait | int | Seconds to sleep when Semantic Scholar returns HTTP 429 |
Usage Examples
Default Configuration
import os
# Default configuration for deep paper collection
PDF_FOLDER = "Downloaded_Papers"
os.makedirs(PDF_FOLDER, exist_ok=True)
processed_papers = {}
max_papers = 1000
paper_count = 0
max_ref_citations = 200
rate_limit_wait = 10
Custom Configuration for Smaller Collection
import os
# Smaller collection with conservative rate limiting
PDF_FOLDER = "Small_Collection"
os.makedirs(PDF_FOLDER, exist_ok=True)
processed_papers = {}
max_papers = 100 # Collect only 100 papers
paper_count = 0
max_ref_citations = 50 # Fewer references per paper
rate_limit_wait = 15 # More conservative wait time