Principle:Mbzuai oryx Awesome LLM Post training Recursive Paper Fetching
| Knowledge Sources | |
|---|---|
| Domains | Data_Collection, Graph_Traversal, Bibliometrics |
| Last Updated | 2026-02-08 07:30 GMT |
Overview
A depth-limited recursive graph traversal strategy that expands an academic paper corpus by following reference and citation links from known papers.
Description
Recursive Paper Fetching implements a bounded breadth-first expansion of the academic citation graph. Starting from seed papers, it fetches each paper's full metadata along with its reference and citation lists, then recursively fetches those linked papers up to a configurable depth limit. The algorithm incorporates three critical safeguards: depth limiting (to prevent infinite recursion), deduplication (to avoid re-fetching papers already in the corpus), and a global paper count cap (to bound total collection size).
This technique addresses the fundamental challenge of building comprehensive domain-specific corpora: a keyword search alone misses papers that are relevant but use different terminology. By traversing the citation graph, the pipeline discovers papers that are structurally connected to the seed set regardless of keyword overlap.
Usage
Use this principle when building a research corpus that needs to capture the full citation neighborhood around a set of seed papers. It is appropriate when:
- Simple keyword search is insufficient for complete domain coverage
- The citation graph structure is informative for understanding the field
- Collection size must be bounded despite the exponential growth of citation links
- Deduplication across recursive branches is necessary
Theoretical Basis
The algorithm performs a depth-limited graph traversal over the academic citation graph:
Where:
- p is a paper ID
- d is the current recursion depth
- Dmax is the maximum allowed depth
- visited is a global deduplication set
Pseudo-code Logic:
# Abstract recursive fetching algorithm (NOT real implementation)
def fetch(paper_id, depth):
if depth > MAX_DEPTH or paper_id in visited or count >= MAX_PAPERS:
return None
metadata = api.get_paper(paper_id)
visited.add(paper_id)
count += 1
for ref_id in metadata.references[:MAX_PER_PAPER]:
fetch(ref_id, depth + 1)
for cite_id in metadata.citations[:MAX_PER_PAPER]:
fetch(cite_id, depth + 1)
return metadata
The exponential branching factor is controlled by three bounds: depth limit, per-paper reference/citation cap, and total paper count.