Principle:Marker Inc Korea AutoRAG Corpus Sampling

Knowledge Sources	Data Augmentation for Low-Resource QA AutoRAG Documentation
Domains	Information Retrieval, Evaluation Methodology, Data Sampling
Last Updated	2026-02-12 00:00 GMT

Overview

Corpus sampling selects a subset of passages from a chunked corpus to serve as ground-truth evidence for question-answer pair generation, establishing the retrieval targets that define the evaluation dataset.

Description

After a document corpus has been chunked into individual passages, not every passage may be suitable or necessary for QA generation. Corpus sampling selects which passages will be used as the basis for generating evaluation questions and answers. Each sampled passage becomes a retrieval ground truth entry: when the RAG system is later evaluated, it is expected to retrieve these passages (or overlapping ones) in response to the generated questions.

Random single-hop sampling is the simplest and most commonly used strategy. It draws a uniformly random subset of n passages from the corpus, assigning each a unique question ID and wrapping the passage's document ID as the retrieval ground truth. The term "single-hop" indicates that each question is grounded in exactly one passage. This contrasts with multi-hop sampling, where two or more passages are selected together to form the evidence set for a single complex question.

Range single-hop sampling provides deterministic control by selecting passages from a specified index range within the corpus DataFrame. This is useful for systematic coverage, reproducible experiments, or targeted sampling from specific document regions.

The design of the sampling step has a direct impact on the evaluation dataset's coverage and balance. Random sampling may over-represent long documents (which produce more chunks), while range-based sampling can be used to ensure even coverage across all source documents.

Usage

Corpus sampling is used after chunking and before query generation. It is invoked via the Corpus class's sample() method, which accepts a sampling function and its parameters. The output is a QA instance containing question IDs and retrieval ground truths, ready for query generation via batch_apply().

Theoretical Basis

The sampling process can be formalized as:

INPUT:  Corpus DataFrame C with columns (doc_id, contents, path, start_end_idx, metadata)
        Sampling parameters: n (sample size), random_state (seed)
OUTPUT: QA DataFrame Q with columns (qid, retrieval_gt)

Random Single-Hop:
    S = RandomSample(C, n, seed=random_state)
    For each row s_i in S:
        qid_i = UUID()
        retrieval_gt_i = [[s_i.doc_id]]   # nested list: outer = alternative sets, inner = passage IDs
    Q = DataFrame(qid, retrieval_gt)

Range Single-Hop:
    S = C[idx_range]
    For each row s_i in S:
        qid_i = UUID()
        retrieval_gt_i = [[s_i.doc_id]]
    Q = DataFrame(qid, retrieval_gt)

The retrieval_gt column uses a nested list structure: the outer list represents alternative acceptable retrieval sets (any one of which is correct), and the inner list contains the document IDs within that set. For single-hop sampling, this simplifies to [[[doc_id]]], meaning there is one acceptable retrieval set containing one passage.

Sampling considerations:

Consideration	Impact
Sample size (n)	Larger samples produce more QA pairs but increase LLM costs for query/answer generation
Random state	Fixed seed ensures reproducibility across runs
Single-hop vs. multi-hop	Single-hop is simpler and more reliable; multi-hop tests cross-passage reasoning
Document balance	Random sampling may favor longer documents; stratified sampling can address this

Related Pages

Implemented By

Implementation:Marker_Inc_Korea_AutoRAG_Random_Single_Hop

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment