Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mbzuai oryx Awesome LLM Post training Reference Citation Cap 200

From Leeroopedia




Knowledge Sources
Domains Data_Collection, Optimization
Last Updated 2026-02-08 08:00 GMT

Overview

Per-paper cap of 200 references and 200 citations to control breadth during recursive citation graph traversal.

Description

Highly cited papers can have thousands of references and citations. Fetching all of them at each recursion level would cause exponential growth in API calls and corpus size. The script caps the number of references and citations fetched per paper to 200 each via the max_ref_citations global variable and Python list slicing. This creates an upper bound on the branching factor of the citation graph traversal.

Usage

Apply this heuristic when crawling citation graphs or any recursive structure with high fanout. The cap of 200 is generous enough to capture most references for typical papers (which average 30-50 references) while protecting against outlier papers with thousands of citations.

The Insight (Rule of Thumb)

  • Action: Slice the references and citations lists to `[:max_ref_citations]` before iterating.
  • Value: 200 per direction (references and citations). Most papers have fewer than 200 references, so this cap primarily limits highly-cited papers.
  • Trade-off: For papers with 500+ citations, only the first 200 (as returned by the API) are followed. The API returns citations in an unspecified order, so the "first 200" are not necessarily the most relevant or highly-cited ones.

Reasoning

Without this cap, a single highly-cited survey paper could trigger 2000+ recursive API calls just for its citations alone. Combined with the depth-2 recursion limit, each seed paper could theoretically spawn up to 200 x 200 = 40,000 recursive calls. The global max_papers=1000 limit provides the final safety valve, but the per-paper cap reduces unnecessary API calls for papers that are fetched after the global limit is reached.

The value of 200 was chosen to be:

  1. High enough to capture all references for most papers (typical paper has 30-50 references).
  2. Low enough to prevent any single paper from dominating the crawl budget.

Code evidence from `scripts/deep_collection_sementic.py:16`:

max_ref_citations = 200  # Limit references & citations per paper

Code evidence from `scripts/deep_collection_sementic.py:76-77`:

ref_ids = [ref["paperId"] for ref in paper.get("references", []) if "paperId" in ref][:max_ref_citations]
cite_ids = [cite["paperId"] for cite in paper.get("citations", []) if "paperId" in cite][:max_ref_citations]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment