Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mbzuai oryx Awesome LLM Post training Collection Config Variables

From Leeroopedia


Knowledge Sources
Domains Data_Collection, Configuration
Last Updated 2026-02-08 07:30 GMT

Overview

Concrete configuration pattern for establishing global collection parameters in the deep paper collection script.

Description

The module-level configuration block in deep_collection_sementic.py defines six global variables that govern the entire paper crawling pipeline. These include the output directory path, a deduplication dictionary, a paper count cap, a running counter, a per-paper reference/citation limit, and an API rate-limit wait time. The block also eagerly creates the output directory using os.makedirs.

Usage

Set these variables before running the paper collection pipeline. Adjust max_papers to control corpus size, max_ref_citations to control breadth of reference/citation crawling, and rate_limit_wait to comply with Semantic Scholar API rate limits.

Code Reference

Source Location

Signature

# Module-level configuration variables
PDF_FOLDER = "Downloaded_Papers"          # str: output directory name
os.makedirs(PDF_FOLDER, exist_ok=True)    # creates directory eagerly

processed_papers = {}                      # dict: tracks processed paper IDs -> details
max_papers = 1000                          # int: cap on total papers to collect
paper_count = 0                            # int: running counter of collected papers
max_ref_citations = 200                    # int: max references/citations per paper
rate_limit_wait = 10                       # int: seconds to wait on HTTP 429

Import

import os

I/O Contract

Inputs

Name Type Required Description
(hardcoded values) Various Yes All values are defined as literals at module level; no external input required

Outputs

Name Type Description
PDF_FOLDER str Directory name for downloaded papers (created on disk)
processed_papers dict Empty dict used for deduplication across recursive fetches
max_papers int Upper bound on total papers the pipeline will collect
paper_count int Running counter (starts at 0), incremented by fetch_paper_details
max_ref_citations int Maximum references and citations fetched per individual paper
rate_limit_wait int Seconds to sleep when Semantic Scholar returns HTTP 429

Usage Examples

Default Configuration

import os

# Default configuration for deep paper collection
PDF_FOLDER = "Downloaded_Papers"
os.makedirs(PDF_FOLDER, exist_ok=True)

processed_papers = {}
max_papers = 1000
paper_count = 0
max_ref_citations = 200
rate_limit_wait = 10

Custom Configuration for Smaller Collection

import os

# Smaller collection with conservative rate limiting
PDF_FOLDER = "Small_Collection"
os.makedirs(PDF_FOLDER, exist_ok=True)

processed_papers = {}
max_papers = 100          # Collect only 100 papers
paper_count = 0
max_ref_citations = 50    # Fewer references per paper
rate_limit_wait = 15      # More conservative wait time

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment