Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Marker Inc Korea AutoRAG Schema Construction

From Leeroopedia
Knowledge Sources
Domains Software Engineering, Data Pipeline Design, Natural Language Processing
Last Updated 2026-02-12 00:00 GMT

Overview

Schema construction creates typed, chainable data containers that define the data flow through the RAG evaluation data pipeline, using the fluent API pattern to enable readable method chaining from raw documents to finished QA datasets.

Description

In a RAG evaluation data creation workflow, data progresses through well-defined stages: raw parsed text, chunked corpus passages, and question-answer pairs. Schema construction formalizes each stage as a distinct class with typed fields, transformation methods, and explicit links to predecessor stages. This approach enforces data integrity at each transition and makes the pipeline self-documenting.

The fluent API pattern (also called method chaining) is a software design technique where each method on an object returns an object (often of a new type), enabling a sequence of operations to be expressed as a single chained expression. In the context of evaluation data creation, the chain follows the progression Raw -> Corpus -> QA. A Raw instance wraps parsed document data and exposes a chunk() method that produces a Corpus. A Corpus wraps chunked passages and exposes a sample() method that produces a QA instance. The QA instance then supports methods for query generation, answer generation, filtering, and export.

The key advantage of this pattern is that it encodes the valid pipeline transitions directly in the type system. A developer cannot accidentally generate queries before chunking, because the batch_apply() method for query generation is only available on the QA class. Similarly, each stage maintains a back-link to its predecessor (Corpus links to its source Raw, QA links to its source Corpus), enabling operations like corpus remapping that need to traverse the full provenance chain.

Usage

Schema construction is used whenever the evaluation data creation workflow is driven programmatically via the Python API (as opposed to the YAML-based CLI). The fluent API is the primary interface for composing custom data generation pipelines, allowing researchers to mix and match chunking strategies, sampling methods, query generators, answer generators, and quality filters in a single expression chain.

Theoretical Basis

The pipeline type graph can be formalized as:

Raw --(chunk)--> Corpus --(sample)--> QA

Where:
    Raw.data     : DataFrame[raw_id, contents]
    Corpus.data  : DataFrame[doc_id, contents, path, start_end_idx, metadata]
    QA.data      : DataFrame[qid, query, retrieval_gt, generation_gt]

Back-links:
    Corpus._linked_raw  -> Raw
    QA._linked_corpus   -> Corpus

Transformation methods at each stage follow a consistent functional pattern:

Method Input Output Description
Raw.batch_apply(fn) Raw Raw Apply async function to each row, return new Raw
Raw.map(fn) Raw Raw Apply sync function to DataFrame, return new Raw
Raw.chunk(module) Raw Corpus Chunk the raw text using named module, return Corpus linked to this Raw
Corpus.batch_apply(fn) Corpus Corpus Apply async function to each row, return new Corpus
Corpus.sample(fn) Corpus QA Sample passages for QA generation, return QA linked to this Corpus
QA.batch_apply(fn) QA QA Apply async function (query gen, answer gen) to each row
QA.filter(fn) QA QA Remove rows where fn returns False
QA.batch_filter(fn) QA QA Async filter, remove rows where fn returns False
QA.to_parquet(paths) QA Files Serialize to parquet
QA.update_corpus(new) QA QA Remap retrieval_gt to new corpus

Provenance chain: The back-links between stages enable full traceability. Given any QA pair, the system can trace back to the specific corpus chunk (via retrieval_gt and linked_corpus), and from there to the raw parsed text (via linked_raw). This chain is essential for corpus remapping and for debugging data quality issues.

A typical fluent pipeline expression looks like:

qa = (Raw(parsed_df)
      .chunk("token", chunk_size=512, chunk_overlap=64)
      .sample(random_single_hop, n=100)
      .make_retrieval_gt_contents()
      .batch_apply(factoid_query_gen, llm=llm)
      .batch_apply(make_basic_gen_gt, llm=llm)
      .filter(dontknow_filter_rule_based, lang="en"))
qa.to_parquet("qa.parquet", "corpus.parquet")

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment