Principle:Marker Inc Korea AutoRAG Corpus Sampling
| Knowledge Sources | |
|---|---|
| Domains | Information Retrieval, Evaluation Methodology, Data Sampling |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Corpus sampling selects a subset of passages from a chunked corpus to serve as ground-truth evidence for question-answer pair generation, establishing the retrieval targets that define the evaluation dataset.
Description
After a document corpus has been chunked into individual passages, not every passage may be suitable or necessary for QA generation. Corpus sampling selects which passages will be used as the basis for generating evaluation questions and answers. Each sampled passage becomes a retrieval ground truth entry: when the RAG system is later evaluated, it is expected to retrieve these passages (or overlapping ones) in response to the generated questions.
Random single-hop sampling is the simplest and most commonly used strategy. It draws a uniformly random subset of n passages from the corpus, assigning each a unique question ID and wrapping the passage's document ID as the retrieval ground truth. The term "single-hop" indicates that each question is grounded in exactly one passage. This contrasts with multi-hop sampling, where two or more passages are selected together to form the evidence set for a single complex question.
Range single-hop sampling provides deterministic control by selecting passages from a specified index range within the corpus DataFrame. This is useful for systematic coverage, reproducible experiments, or targeted sampling from specific document regions.
The design of the sampling step has a direct impact on the evaluation dataset's coverage and balance. Random sampling may over-represent long documents (which produce more chunks), while range-based sampling can be used to ensure even coverage across all source documents.
Usage
Corpus sampling is used after chunking and before query generation. It is invoked via the Corpus class's sample() method, which accepts a sampling function and its parameters. The output is a QA instance containing question IDs and retrieval ground truths, ready for query generation via batch_apply().
Theoretical Basis
The sampling process can be formalized as:
INPUT: Corpus DataFrame C with columns (doc_id, contents, path, start_end_idx, metadata)
Sampling parameters: n (sample size), random_state (seed)
OUTPUT: QA DataFrame Q with columns (qid, retrieval_gt)
Random Single-Hop:
S = RandomSample(C, n, seed=random_state)
For each row s_i in S:
qid_i = UUID()
retrieval_gt_i = [[s_i.doc_id]] # nested list: outer = alternative sets, inner = passage IDs
Q = DataFrame(qid, retrieval_gt)
Range Single-Hop:
S = C[idx_range]
For each row s_i in S:
qid_i = UUID()
retrieval_gt_i = [[s_i.doc_id]]
Q = DataFrame(qid, retrieval_gt)
The retrieval_gt column uses a nested list structure: the outer list represents alternative acceptable retrieval sets (any one of which is correct), and the inner list contains the document IDs within that set. For single-hop sampling, this simplifies to [[[doc_id]]], meaning there is one acceptable retrieval set containing one passage.
Sampling considerations:
| Consideration | Impact |
|---|---|
| Sample size (n) | Larger samples produce more QA pairs but increase LLM costs for query/answer generation |
| Random state | Fixed seed ensures reproducibility across runs |
| Single-hop vs. multi-hop | Single-hop is simpler and more reliable; multi-hop tests cross-passage reasoning |
| Document balance | Random sampling may favor longer documents; stratified sampling can address this |