Principle:Marker Inc Korea AutoRAG Answer Generation
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Question Answering, Evaluation Methodology |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Answer generation creates ground-truth answers for evaluation QA pairs by having a large language model answer generated questions using the source passages, producing the reference outputs against which a RAG system's responses are measured.
Description
After queries have been generated from corpus passages, the next step is to produce corresponding ground-truth answers. These answers serve as the gold standard for evaluating the RAG pipeline's generation component. An ideal ground-truth answer is factually correct, grounded in the source passage, and appropriately detailed.
Answer generation follows a reading comprehension paradigm: given a passage and a question, the LLM must produce an answer that is fully supported by the passage text. This is achieved by constructing a prompt that includes the passage content, the question, and instructions specifying the desired answer style. The LLM's response is then recorded as the ground-truth answer.
Three main answer generation variants are supported:
- Basic (detailed) generation instructs the LLM to provide a thorough, well-explained answer based on the passage. This produces longer answers suitable for evaluating generation quality metrics such as ROUGE and semantic similarity.
- Concise generation instructs the LLM to provide a brief answer, typically a single phrase or sentence. This is useful for evaluating exact-match or F1-based metrics.
- Custom generation accepts a user-defined system prompt, enabling domain-specific answer formatting (e.g., structured clinical answers, legal citations).
The answer generation step can be applied multiple times to the same QA pair, appending additional answers to the generation_gt list. This allows building datasets with multiple reference answers per question, which is valuable for evaluation metrics that benefit from answer diversity.
Usage
Answer generation is applied after query generation and before quality filtering. It is invoked via the QA class's batch_apply() method with an async answer generation function and an LLM instance. The retrieval_gt_contents column must be populated before calling this function.
Theoretical Basis
The answer generation process follows this pattern:
INPUT: QA DataFrame with columns (qid, query, retrieval_gt, retrieval_gt_contents)
LLM instance L
Language lang
System prompt S (varies by variant: basic, concise, custom)
OUTPUT: QA DataFrame with added/updated column (generation_gt: List[str])
For each row r_i:
passage_str = join(r_i.retrieval_gt_contents)
user_prompt = format("Text:\n{passage}\n\nQuestion:\n{query}\n\nAnswer:", passage_str, r_i.query)
messages = [SystemMessage(S), UserMessage(user_prompt)]
response = L.achat(messages, temperature=0.0)
r_i.generation_gt = append(r_i.generation_gt, response.content)
Setting temperature=0.0 is a deliberate choice to ensure deterministic, factual answers rather than creative or varied ones. Ground-truth answers should be as reliable and consistent as possible.
The generation_gt accumulation pattern is noteworthy: the add_gen_gt helper function handles three cases:
if "generation_gt" not in row:
row["generation_gt"] = [new_answer]
elif isinstance(row["generation_gt"], list):
row["generation_gt"].append(new_answer)
elif isinstance(row["generation_gt"], str):
row["generation_gt"] = [row["generation_gt"], new_answer]
This allows multiple invocations of answer generation to build up a list of reference answers, supporting multi-reference evaluation metrics.
Answer quality considerations:
| Consideration | Description |
|---|---|
| Groundedness | Answer must be derivable from the provided passage, not from LLM pre-training knowledge |
| Completeness | For basic mode, the answer should address all aspects of the question |
| Conciseness | For concise mode, the answer should be as brief as possible while remaining correct |
| Answerability detection | If the passage does not contain sufficient information, the LLM may produce a "don't know" response, which is handled by downstream quality filtering |