Implementation:Marker Inc Korea AutoRAG Random Single Hop

Knowledge Sources	AutoRAG
Domains	Information Retrieval, Evaluation Methodology, Data Sampling
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for randomly sampling single-hop passages from a corpus to serve as retrieval ground truth for QA generation provided by the AutoRAG framework.

Description

The random_single_hop function selects a uniformly random subset of n passages from the corpus DataFrame. For each sampled passage, it generates a unique question ID (UUID) and wraps the passage's doc_id in the nested list format expected by the retrieval ground truth schema: [[[doc_id]]]. The outer list represents alternative acceptable retrieval sets, and the inner list contains the passages in each set. For single-hop queries, there is exactly one set containing one passage.

The companion range_single_hop function provides deterministic sampling by selecting passages from a specific index range, useful for systematic coverage or reproducible experiments.

Both functions return a DataFrame with qid and retrieval_gt columns, which is then wrapped in a QA instance by the Corpus.sample() method.

Usage

Import and use these functions as the sampling function argument to Corpus.sample(). This step follows corpus creation and precedes query generation.

Code Reference

Source Location

Repository: AutoRAG
File: autorag/data/qa/sample.py (lines 7-26)

Signature

def random_single_hop(
    corpus_df: pd.DataFrame, n: int, random_state: int = 42
) -> pd.DataFrame:
    ...

def range_single_hop(corpus_df: pd.DataFrame, idx_range: Iterable) -> pd.DataFrame:
    ...

Import

from autorag.data.qa.sample import random_single_hop, range_single_hop

I/O Contract

Inputs

Name	Type	Required	Description
corpus_df	pd.DataFrame	yes	Chunked corpus DataFrame with at least a doc_id column
n	int	yes (random_single_hop)	Number of passages to sample from the corpus
random_state	int	no	Random seed for reproducibility. Defaults to 42.
idx_range	Iterable	yes (range_single_hop)	Index range or iterable of indices to select from the corpus (e.g., range(0, 100))

Outputs

Name	Type	Description
result	pd.DataFrame	DataFrame with columns: qid (str, UUID), retrieval_gt (List[List[str]], nested list of doc_ids). Wrapped in a QA instance by Corpus.sample().

Usage Examples

Basic Usage

from autorag.data.qa.schema import Raw
from autorag.data.qa.sample import random_single_hop

# Assume corpus is already created
corpus = Raw(parsed_df).chunk("token", chunk_size=512)

# Sample 100 random passages for QA generation
qa = corpus.sample(random_single_hop, n=100)

# qa.data has columns: qid, retrieval_gt
print(qa.data.head())

Using range_single_hop

from autorag.data.qa.sample import range_single_hop

# Select passages at indices 0-49 deterministically
qa = corpus.sample(range_single_hop, idx_range=range(0, 50))

With Custom Random State

from autorag.data.qa.sample import random_single_hop

# Use a different seed for a different sample
qa = corpus.sample(random_single_hop, n=200, random_state=123)

Related Pages

Implements Principle

Principle:Marker_Inc_Korea_AutoRAG_Corpus_Sampling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment