Implementation:Marker Inc Korea AutoRAG Random Single Hop
| Knowledge Sources | |
|---|---|
| Domains | Information Retrieval, Evaluation Methodology, Data Sampling |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for randomly sampling single-hop passages from a corpus to serve as retrieval ground truth for QA generation provided by the AutoRAG framework.
Description
The random_single_hop function selects a uniformly random subset of n passages from the corpus DataFrame. For each sampled passage, it generates a unique question ID (UUID) and wraps the passage's doc_id in the nested list format expected by the retrieval ground truth schema: [[[doc_id]]]. The outer list represents alternative acceptable retrieval sets, and the inner list contains the passages in each set. For single-hop queries, there is exactly one set containing one passage.
The companion range_single_hop function provides deterministic sampling by selecting passages from a specific index range, useful for systematic coverage or reproducible experiments.
Both functions return a DataFrame with qid and retrieval_gt columns, which is then wrapped in a QA instance by the Corpus.sample() method.
Usage
Import and use these functions as the sampling function argument to Corpus.sample(). This step follows corpus creation and precedes query generation.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/data/qa/sample.py (lines 7-26)
Signature
def random_single_hop(
corpus_df: pd.DataFrame, n: int, random_state: int = 42
) -> pd.DataFrame:
...
def range_single_hop(corpus_df: pd.DataFrame, idx_range: Iterable) -> pd.DataFrame:
...
Import
from autorag.data.qa.sample import random_single_hop, range_single_hop
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| corpus_df | pd.DataFrame | yes | Chunked corpus DataFrame with at least a doc_id column |
| n | int | yes (random_single_hop) | Number of passages to sample from the corpus |
| random_state | int | no | Random seed for reproducibility. Defaults to 42. |
| idx_range | Iterable | yes (range_single_hop) | Index range or iterable of indices to select from the corpus (e.g., range(0, 100)) |
Outputs
| Name | Type | Description |
|---|---|---|
| result | pd.DataFrame | DataFrame with columns: qid (str, UUID), retrieval_gt (List[List[str]], nested list of doc_ids). Wrapped in a QA instance by Corpus.sample(). |
Usage Examples
Basic Usage
from autorag.data.qa.schema import Raw
from autorag.data.qa.sample import random_single_hop
# Assume corpus is already created
corpus = Raw(parsed_df).chunk("token", chunk_size=512)
# Sample 100 random passages for QA generation
qa = corpus.sample(random_single_hop, n=100)
# qa.data has columns: qid, retrieval_gt
print(qa.data.head())
Using range_single_hop
from autorag.data.qa.sample import range_single_hop
# Select passages at indices 0-49 deterministically
qa = corpus.sample(range_single_hop, idx_range=range(0, 50))
With Custom Random State
from autorag.data.qa.sample import random_single_hop
# Use a different seed for a different sample
qa = corpus.sample(random_single_hop, n=200, random_state=123)