Implementation:Mbzuai oryx Awesome LLM Post training Collection Config Variables

Knowledge Sources	Awesome-LLM-Post-training Semantic Scholar API Docs
Domains	Data_Collection, Configuration
Last Updated	2026-02-08 07:30 GMT

Overview

Concrete configuration pattern for establishing global collection parameters in the deep paper collection script.

Description

The module-level configuration block in deep_collection_sementic.py defines six global variables that govern the entire paper crawling pipeline. These include the output directory path, a deduplication dictionary, a paper count cap, a running counter, a per-paper reference/citation limit, and an API rate-limit wait time. The block also eagerly creates the output directory using os.makedirs.

Usage

Set these variables before running the paper collection pipeline. Adjust max_papers to control corpus size, max_ref_citations to control breadth of reference/citation crawling, and rate_limit_wait to comply with Semantic Scholar API rate limits.

Code Reference

Source Location

Repository: Awesome-LLM-Post-training
File: scripts/deep_collection_sementic.py
Lines: 8-17

Signature

# Module-level configuration variables
PDF_FOLDER = "Downloaded_Papers"          # str: output directory name
os.makedirs(PDF_FOLDER, exist_ok=True)    # creates directory eagerly

processed_papers = {}                      # dict: tracks processed paper IDs -> details
max_papers = 1000                          # int: cap on total papers to collect
paper_count = 0                            # int: running counter of collected papers
max_ref_citations = 200                    # int: max references/citations per paper
rate_limit_wait = 10                       # int: seconds to wait on HTTP 429

Import

import os

I/O Contract

Inputs

Name	Type	Required	Description
(hardcoded values)	Various	Yes	All values are defined as literals at module level; no external input required

Outputs

Name	Type	Description
PDF_FOLDER	str	Directory name for downloaded papers (created on disk)
processed_papers	dict	Empty dict used for deduplication across recursive fetches
max_papers	int	Upper bound on total papers the pipeline will collect
paper_count	int	Running counter (starts at 0), incremented by fetch_paper_details
max_ref_citations	int	Maximum references and citations fetched per individual paper
rate_limit_wait	int	Seconds to sleep when Semantic Scholar returns HTTP 429

Usage Examples

Default Configuration

import os

# Default configuration for deep paper collection
PDF_FOLDER = "Downloaded_Papers"
os.makedirs(PDF_FOLDER, exist_ok=True)

processed_papers = {}
max_papers = 1000
paper_count = 0
max_ref_citations = 200
rate_limit_wait = 10

Custom Configuration for Smaller Collection

import os

# Smaller collection with conservative rate limiting
PDF_FOLDER = "Small_Collection"
os.makedirs(PDF_FOLDER, exist_ok=True)

processed_papers = {}
max_papers = 100          # Collect only 100 papers
paper_count = 0
max_ref_citations = 50    # Fewer references per paper
rate_limit_wait = 15      # More conservative wait time

Related Pages

Implements Principle

Principle:Mbzuai_oryx_Awesome_LLM_Post_training_Collection_Parameter_Configuration

Requires Environment

Environment:Mbzuai_oryx_Awesome_LLM_Post_training_Python_Requests

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment