Principle:Mbzuai oryx Awesome LLM Post training Recursive Paper Fetching

Knowledge Sources	Construction of the Literature Graph in Semantic Scholar Snowball Sampling in Systematic Reviews
Domains	Data_Collection, Graph_Traversal, Bibliometrics
Last Updated	2026-02-08 07:30 GMT

Overview

A depth-limited recursive graph traversal strategy that expands an academic paper corpus by following reference and citation links from known papers.

Description

Recursive Paper Fetching implements a bounded breadth-first expansion of the academic citation graph. Starting from seed papers, it fetches each paper's full metadata along with its reference and citation lists, then recursively fetches those linked papers up to a configurable depth limit. The algorithm incorporates three critical safeguards: depth limiting (to prevent infinite recursion), deduplication (to avoid re-fetching papers already in the corpus), and a global paper count cap (to bound total collection size).

This technique addresses the fundamental challenge of building comprehensive domain-specific corpora: a keyword search alone misses papers that are relevant but use different terminology. By traversing the citation graph, the pipeline discovers papers that are structurally connected to the seed set regardless of keyword overlap.

Usage

Use this principle when building a research corpus that needs to capture the full citation neighborhood around a set of seed papers. It is appropriate when:

Simple keyword search is insufficient for complete domain coverage
The citation graph structure is informative for understanding the field
Collection size must be bounded despite the exponential growth of citation links
Deduplication across recursive branches is necessary

Theoretical Basis

The algorithm performs a depth-limited graph traversal over the academic citation graph:

$fetch (p, d) = {\begin{cases} metadata (p) \cup ⋃_{r \in refs (p)} fetch (r, d + 1) \cup ⋃_{c \in cites (p)} fetch (c, d + 1) & if d \leq D_{\max} and p \notin visited \\ \emptyset & otherwise \end{cases}$

Where:

p is a paper ID
d is the current recursion depth
D_max is the maximum allowed depth
visited is a global deduplication set

Pseudo-code Logic:

# Abstract recursive fetching algorithm (NOT real implementation)
def fetch(paper_id, depth):
    if depth > MAX_DEPTH or paper_id in visited or count >= MAX_PAPERS:
        return None
    metadata = api.get_paper(paper_id)
    visited.add(paper_id)
    count += 1
    for ref_id in metadata.references[:MAX_PER_PAPER]:
        fetch(ref_id, depth + 1)
    for cite_id in metadata.citations[:MAX_PER_PAPER]:
        fetch(cite_id, depth + 1)
    return metadata

The exponential branching factor is controlled by three bounds: depth limit, per-paper reference/citation cap, and total paper count.

Related Pages

Implemented By

Implementation:Mbzuai_oryx_Awesome_LLM_Post_training_Fetch_Paper_Details

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment