Principle:Marker Inc Korea AutoRAG Schema Construction
| Knowledge Sources | |
|---|---|
| Domains | Software Engineering, Data Pipeline Design, Natural Language Processing |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Schema construction creates typed, chainable data containers that define the data flow through the RAG evaluation data pipeline, using the fluent API pattern to enable readable method chaining from raw documents to finished QA datasets.
Description
In a RAG evaluation data creation workflow, data progresses through well-defined stages: raw parsed text, chunked corpus passages, and question-answer pairs. Schema construction formalizes each stage as a distinct class with typed fields, transformation methods, and explicit links to predecessor stages. This approach enforces data integrity at each transition and makes the pipeline self-documenting.
The fluent API pattern (also called method chaining) is a software design technique where each method on an object returns an object (often of a new type), enabling a sequence of operations to be expressed as a single chained expression. In the context of evaluation data creation, the chain follows the progression Raw -> Corpus -> QA. A Raw instance wraps parsed document data and exposes a chunk() method that produces a Corpus. A Corpus wraps chunked passages and exposes a sample() method that produces a QA instance. The QA instance then supports methods for query generation, answer generation, filtering, and export.
The key advantage of this pattern is that it encodes the valid pipeline transitions directly in the type system. A developer cannot accidentally generate queries before chunking, because the batch_apply() method for query generation is only available on the QA class. Similarly, each stage maintains a back-link to its predecessor (Corpus links to its source Raw, QA links to its source Corpus), enabling operations like corpus remapping that need to traverse the full provenance chain.
Usage
Schema construction is used whenever the evaluation data creation workflow is driven programmatically via the Python API (as opposed to the YAML-based CLI). The fluent API is the primary interface for composing custom data generation pipelines, allowing researchers to mix and match chunking strategies, sampling methods, query generators, answer generators, and quality filters in a single expression chain.
Theoretical Basis
The pipeline type graph can be formalized as:
Raw --(chunk)--> Corpus --(sample)--> QA
Where:
Raw.data : DataFrame[raw_id, contents]
Corpus.data : DataFrame[doc_id, contents, path, start_end_idx, metadata]
QA.data : DataFrame[qid, query, retrieval_gt, generation_gt]
Back-links:
Corpus._linked_raw -> Raw
QA._linked_corpus -> Corpus
Transformation methods at each stage follow a consistent functional pattern:
| Method | Input | Output | Description |
|---|---|---|---|
| Raw.batch_apply(fn) | Raw | Raw | Apply async function to each row, return new Raw |
| Raw.map(fn) | Raw | Raw | Apply sync function to DataFrame, return new Raw |
| Raw.chunk(module) | Raw | Corpus | Chunk the raw text using named module, return Corpus linked to this Raw |
| Corpus.batch_apply(fn) | Corpus | Corpus | Apply async function to each row, return new Corpus |
| Corpus.sample(fn) | Corpus | QA | Sample passages for QA generation, return QA linked to this Corpus |
| QA.batch_apply(fn) | QA | QA | Apply async function (query gen, answer gen) to each row |
| QA.filter(fn) | QA | QA | Remove rows where fn returns False |
| QA.batch_filter(fn) | QA | QA | Async filter, remove rows where fn returns False |
| QA.to_parquet(paths) | QA | Files | Serialize to parquet |
| QA.update_corpus(new) | QA | QA | Remap retrieval_gt to new corpus |
Provenance chain: The back-links between stages enable full traceability. Given any QA pair, the system can trace back to the specific corpus chunk (via retrieval_gt and linked_corpus), and from there to the raw parsed text (via linked_raw). This chain is essential for corpus remapping and for debugging data quality issues.
A typical fluent pipeline expression looks like:
qa = (Raw(parsed_df)
.chunk("token", chunk_size=512, chunk_overlap=64)
.sample(random_single_hop, n=100)
.make_retrieval_gt_contents()
.batch_apply(factoid_query_gen, llm=llm)
.batch_apply(make_basic_gen_gt, llm=llm)
.filter(dontknow_filter_rule_based, lang="en"))
qa.to_parquet("qa.parquet", "corpus.parquet")