Heuristic:Mbzuai oryx Awesome LLM Post training Depth Limit Recursion At 2
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Optimization |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Recursion depth limit of 2 for citation graph traversal to keep the paper corpus at a manageable size while maintaining relevance.
Description
The deep paper collection script recursively follows references and citations from seed papers. Without a depth limit, the citation graph would expand exponentially, eventually encompassing millions of papers. The script limits recursion to depth 2: seed papers (depth 1) have their references and citations fetched (depth 2), but those second-level papers do not trigger further fetching. This creates a focused corpus of directly and indirectly related papers.
Usage
Apply this heuristic when traversing citation graphs or any recursive graph structure. The depth limit of 2 strikes a balance between corpus breadth and API/time costs. Increase to 3 for broader coverage (at exponentially higher cost) or decrease to 1 for a minimal seed-only corpus.
The Insight (Rule of Thumb)
- Action: Add a `depth` parameter to recursive fetching functions and stop recursing when `depth > 2`.
- Value: Depth 2 captures seed papers plus their immediate citations and references. With a typical paper having 30-50 references and 10-100 citations, depth 2 yields hundreds to thousands of papers from a single seed.
- Trade-off: Depth 2 misses papers that are 3+ hops away from the seed. However, the most relevant papers in a research area are typically within 2 hops of a survey paper. Going to depth 3 would expand the corpus by 10-100x.
Reasoning
Citation graphs follow a power-law distribution: a few papers are highly cited, and most papers have few citations. At depth 1 (seed only), the corpus is too narrow. At depth 2, the corpus captures the "neighborhood" of the seed papers, including seminal works and recent follow-ups. At depth 3+, the corpus rapidly includes tangentially related papers from adjacent research fields, diluting relevance.
Combined with the max_papers=1000 global limit and max_ref_citations=200 per-paper cap, the depth 2 limit creates a three-layer defense against uncontrolled growth.
Code evidence from `scripts/deep_collection_sementic.py:75`:
# Limit depth to avoid long chains
if depth <= 2: # Allow fetching references & citations up to depth 2
ref_ids = [ref["paperId"] for ref in paper.get("references", []) if "paperId" in ref][:max_ref_citations]
cite_ids = [cite["paperId"] for cite in paper.get("citations", []) if "paperId" in cite][:max_ref_citations]
Code evidence from `scripts/deep_collection_sementic.py:37`:
def fetch_paper_details(paper_id, depth=1):
"""Fetches paper details with a limit on recursion depth"""