Principle:Marker Inc Korea AutoRAG Answer Generation

Knowledge Sources	Reading Comprehension and Question Answering AutoRAG Documentation
Domains	Natural Language Processing, Question Answering, Evaluation Methodology
Last Updated	2026-02-12 00:00 GMT

Overview

Answer generation creates ground-truth answers for evaluation QA pairs by having a large language model answer generated questions using the source passages, producing the reference outputs against which a RAG system's responses are measured.

Description

After queries have been generated from corpus passages, the next step is to produce corresponding ground-truth answers. These answers serve as the gold standard for evaluating the RAG pipeline's generation component. An ideal ground-truth answer is factually correct, grounded in the source passage, and appropriately detailed.

Answer generation follows a reading comprehension paradigm: given a passage and a question, the LLM must produce an answer that is fully supported by the passage text. This is achieved by constructing a prompt that includes the passage content, the question, and instructions specifying the desired answer style. The LLM's response is then recorded as the ground-truth answer.

Three main answer generation variants are supported:

Basic (detailed) generation instructs the LLM to provide a thorough, well-explained answer based on the passage. This produces longer answers suitable for evaluating generation quality metrics such as ROUGE and semantic similarity.
Concise generation instructs the LLM to provide a brief answer, typically a single phrase or sentence. This is useful for evaluating exact-match or F1-based metrics.
Custom generation accepts a user-defined system prompt, enabling domain-specific answer formatting (e.g., structured clinical answers, legal citations).

The answer generation step can be applied multiple times to the same QA pair, appending additional answers to the generation_gt list. This allows building datasets with multiple reference answers per question, which is valuable for evaluation metrics that benefit from answer diversity.

Usage

Answer generation is applied after query generation and before quality filtering. It is invoked via the QA class's batch_apply() method with an async answer generation function and an LLM instance. The retrieval_gt_contents column must be populated before calling this function.

Theoretical Basis

The answer generation process follows this pattern:

INPUT:  QA DataFrame with columns (qid, query, retrieval_gt, retrieval_gt_contents)
        LLM instance L
        Language lang
        System prompt S (varies by variant: basic, concise, custom)
OUTPUT: QA DataFrame with added/updated column (generation_gt: List[str])

For each row r_i:
    passage_str = join(r_i.retrieval_gt_contents)
    user_prompt = format("Text:\n{passage}\n\nQuestion:\n{query}\n\nAnswer:", passage_str, r_i.query)
    messages = [SystemMessage(S), UserMessage(user_prompt)]
    response = L.achat(messages, temperature=0.0)
    r_i.generation_gt = append(r_i.generation_gt, response.content)

Setting temperature=0.0 is a deliberate choice to ensure deterministic, factual answers rather than creative or varied ones. Ground-truth answers should be as reliable and consistent as possible.

The generation_gt accumulation pattern is noteworthy: the add_gen_gt helper function handles three cases:

if "generation_gt" not in row:
    row["generation_gt"] = [new_answer]
elif isinstance(row["generation_gt"], list):
    row["generation_gt"].append(new_answer)
elif isinstance(row["generation_gt"], str):
    row["generation_gt"] = [row["generation_gt"], new_answer]

This allows multiple invocations of answer generation to build up a list of reference answers, supporting multi-reference evaluation metrics.

Answer quality considerations:

Consideration	Description
Groundedness	Answer must be derivable from the provided passage, not from LLM pre-training knowledge
Completeness	For basic mode, the answer should address all aspects of the question
Conciseness	For concise mode, the answer should be as brief as possible while remaining correct
Answerability detection	If the passage does not contain sufficient information, the LLM may produce a "don't know" response, which is handled by downstream quality filtering

Related Pages

Implemented By

Implementation:Marker_Inc_Korea_AutoRAG_Make_Basic_Gen_Gt

Uses Heuristic

Heuristic:Marker_Inc_Korea_AutoRAG_Deterministic_Evaluation_Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment