Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mbzuai oryx Awesome LLM Post training Recursive Paper Fetching

From Leeroopedia


Knowledge Sources
Domains Data_Collection, Graph_Traversal, Bibliometrics
Last Updated 2026-02-08 07:30 GMT

Overview

A depth-limited recursive graph traversal strategy that expands an academic paper corpus by following reference and citation links from known papers.

Description

Recursive Paper Fetching implements a bounded breadth-first expansion of the academic citation graph. Starting from seed papers, it fetches each paper's full metadata along with its reference and citation lists, then recursively fetches those linked papers up to a configurable depth limit. The algorithm incorporates three critical safeguards: depth limiting (to prevent infinite recursion), deduplication (to avoid re-fetching papers already in the corpus), and a global paper count cap (to bound total collection size).

This technique addresses the fundamental challenge of building comprehensive domain-specific corpora: a keyword search alone misses papers that are relevant but use different terminology. By traversing the citation graph, the pipeline discovers papers that are structurally connected to the seed set regardless of keyword overlap.

Usage

Use this principle when building a research corpus that needs to capture the full citation neighborhood around a set of seed papers. It is appropriate when:

  • Simple keyword search is insufficient for complete domain coverage
  • The citation graph structure is informative for understanding the field
  • Collection size must be bounded despite the exponential growth of citation links
  • Deduplication across recursive branches is necessary

Theoretical Basis

The algorithm performs a depth-limited graph traversal over the academic citation graph:

fetch(p,d)={metadata(p)rrefs(p)fetch(r,d+1)ccites(p)fetch(c,d+1)if dDmax and pvisitedotherwise

Where:

  • p is a paper ID
  • d is the current recursion depth
  • Dmax is the maximum allowed depth
  • visited is a global deduplication set

Pseudo-code Logic:

# Abstract recursive fetching algorithm (NOT real implementation)
def fetch(paper_id, depth):
    if depth > MAX_DEPTH or paper_id in visited or count >= MAX_PAPERS:
        return None
    metadata = api.get_paper(paper_id)
    visited.add(paper_id)
    count += 1
    for ref_id in metadata.references[:MAX_PER_PAPER]:
        fetch(ref_id, depth + 1)
    for cite_id in metadata.citations[:MAX_PER_PAPER]:
        fetch(cite_id, depth + 1)
    return metadata

The exponential branching factor is controlled by three bounds: depth limit, per-paper reference/citation cap, and total paper count.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment