Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mbzuai oryx Awesome LLM Post training Seed Paper Search

From Leeroopedia


Knowledge Sources
Domains Data_Collection, Information_Retrieval
Last Updated 2026-02-08 07:30 GMT

Overview

A search strategy that retrieves an initial set of highly relevant academic papers to serve as entry points for broader corpus construction.

Description

Seed Paper Search is the first active step in a snowball-style academic data collection pipeline. It issues a keyword query against an academic search API and retrieves a small set of top-ranked papers. These seed papers are not the final dataset; instead, they serve as starting nodes for recursive reference and citation crawling. The quality of the seed set directly determines the relevance and coverage of the final corpus.

This approach addresses the cold-start problem in corpus building: without a well-chosen seed set, recursive crawling may drift into irrelevant subfields or miss core papers entirely.

Usage

Use this principle when building a domain-specific academic corpus through programmatic API access. It is the appropriate starting step when:

  • The collection strategy uses snowball sampling (following references and citations)
  • A targeted keyword query can identify a small number of highly relevant papers
  • The API supports field-specific search with metadata retrieval (title, abstract, references, citations)

Theoretical Basis

Seed search follows the snowball sampling methodology from bibliometrics:

  1. Define a precise query representing the target research domain
  2. Retrieve a small, high-quality seed set (typically 1-10 papers)
  3. Use seed papers as starting nodes for graph traversal via references and citations

Pseudo-code Logic:

# Abstract seed search algorithm (NOT real implementation)
seed_papers = academic_api.search(
    query="target domain keywords",
    limit=small_number,
    fields=["title", "abstract", "references", "citations"]
)
for paper in seed_papers:
    recursive_crawl(paper.id, depth=0)

The critical design parameter is the seed query specificity: too broad yields noisy seeds, too narrow misses important subfields.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment