Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Marker Inc Korea AutoRAG Random Single Hop

From Leeroopedia
Knowledge Sources
Domains Information Retrieval, Evaluation Methodology, Data Sampling
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for randomly sampling single-hop passages from a corpus to serve as retrieval ground truth for QA generation provided by the AutoRAG framework.

Description

The random_single_hop function selects a uniformly random subset of n passages from the corpus DataFrame. For each sampled passage, it generates a unique question ID (UUID) and wraps the passage's doc_id in the nested list format expected by the retrieval ground truth schema: [[[doc_id]]]. The outer list represents alternative acceptable retrieval sets, and the inner list contains the passages in each set. For single-hop queries, there is exactly one set containing one passage.

The companion range_single_hop function provides deterministic sampling by selecting passages from a specific index range, useful for systematic coverage or reproducible experiments.

Both functions return a DataFrame with qid and retrieval_gt columns, which is then wrapped in a QA instance by the Corpus.sample() method.

Usage

Import and use these functions as the sampling function argument to Corpus.sample(). This step follows corpus creation and precedes query generation.

Code Reference

Source Location

  • Repository: AutoRAG
  • File: autorag/data/qa/sample.py (lines 7-26)

Signature

def random_single_hop(
    corpus_df: pd.DataFrame, n: int, random_state: int = 42
) -> pd.DataFrame:
    ...

def range_single_hop(corpus_df: pd.DataFrame, idx_range: Iterable) -> pd.DataFrame:
    ...

Import

from autorag.data.qa.sample import random_single_hop, range_single_hop

I/O Contract

Inputs

Name Type Required Description
corpus_df pd.DataFrame yes Chunked corpus DataFrame with at least a doc_id column
n int yes (random_single_hop) Number of passages to sample from the corpus
random_state int no Random seed for reproducibility. Defaults to 42.
idx_range Iterable yes (range_single_hop) Index range or iterable of indices to select from the corpus (e.g., range(0, 100))

Outputs

Name Type Description
result pd.DataFrame DataFrame with columns: qid (str, UUID), retrieval_gt (List[List[str]], nested list of doc_ids). Wrapped in a QA instance by Corpus.sample().

Usage Examples

Basic Usage

from autorag.data.qa.schema import Raw
from autorag.data.qa.sample import random_single_hop

# Assume corpus is already created
corpus = Raw(parsed_df).chunk("token", chunk_size=512)

# Sample 100 random passages for QA generation
qa = corpus.sample(random_single_hop, n=100)

# qa.data has columns: qid, retrieval_gt
print(qa.data.head())

Using range_single_hop

from autorag.data.qa.sample import range_single_hop

# Select passages at indices 0-49 deterministically
qa = corpus.sample(range_single_hop, idx_range=range(0, 50))

With Custom Random State

from autorag.data.qa.sample import random_single_hop

# Use a different seed for a different sample
qa = corpus.sample(random_single_hop, n=200, random_state=123)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment