Implementation:Deepset ai Haystack AnswerBuilder
Overview
AnswerBuilder is a Haystack component that converts a query and Generator replies into structured GeneratedAnswer objects. It supports regex-based answer extraction, document reference parsing with citation tracking, and works with both plain string replies and ChatMessage objects from chat generators.
Source Location
- File:
haystack/components/builders/answer_builder.py(Lines 16-257) - Class:
AnswerBuilder - Component decorator:
@component
Import
from haystack.components.builders import AnswerBuilder
Dependencies
- haystack.dataclasses: Provides
GeneratedAnswer,Document, andChatMessage. - re (standard library): Used for regex-based answer extraction and reference parsing.
Constructor
def __init__(
self,
pattern: str | None = None,
reference_pattern: str | None = None,
last_message_only: bool = False,
*,
return_only_referenced_documents: bool = True,
)
Parameters
- pattern (
str | None): Regular expression pattern to extract the answer text from the generator output. If not specified, the entire response is used as the answer. The pattern may contain at most one capture group:- No capture group: The whole regex match is used as the answer. Example:
[^\n]+$extracts the last line. - One capture group: The captured group text is used as the answer. Example:
Answer: (.*)extracts everything after "Answer: ". - Multiple capture groups: Rejected with a
ValueError.
- No capture group: The whole regex match is used as the answer. Example:
- reference_pattern (
str | None): Regular expression pattern for parsing document references from the generated text. References must be 1-based indices. Example:\[(\d+)\]extracts "1" from "answer[1]". When provided, documents receive a"referenced"metadata field. - last_message_only (
bool): IfTrue, only the last reply is processed. IfFalse(default), all replies are processed. - return_only_referenced_documents (
bool): When used withreference_pattern, ifTrue(default), only documents actually referenced in the reply are included. IfFalse, all documents are included with reference annotations. Has no effect whenreference_patternis not provided.
Run Method
@component.output_types(answers=list[GeneratedAnswer])
def run(
self,
query: str,
replies: list[str] | list[ChatMessage],
meta: list[dict[str, Any]] | None = None,
documents: list[Document] | None = None,
pattern: str | None = None,
reference_pattern: str | None = None,
) -> dict: # Returns {"answers": list[GeneratedAnswer]}
Parameters
- query (
str): The input query that was used as the generator prompt. - replies (
list[str] | list[ChatMessage]): The generator output. Can be plain strings (from non-chat generators) orChatMessageobjects (from chat generators). - meta (
list[dict] | None): Optional metadata from the generator, one dictionary per reply. Must match the length ofrepliesif provided. - documents (
list[Document] | None): Optional source documents used as generator context. When provided, they are attached to theGeneratedAnswerobjects with provenance annotations. - pattern (
str | None): Optional runtime override for the answer extraction pattern. - reference_pattern (
str | None): Optional runtime override for the reference parsing pattern.
Returns
{"answers": list[GeneratedAnswer]}: A dictionary containing structured answer objects, one per processed reply.
Behavior
- Initializes default empty metadata if none is provided; validates that
repliesandmetalengths match. - Validates any runtime pattern for capture group count.
- Selects the pattern and reference_pattern (runtime > init).
- If
last_message_onlyisTrue, restricts processing to the last reply and its metadata. - For each reply:
- Extracts text content: uses
.textforChatMessageobjects,str()for strings. - Extracts metadata: merges
ChatMessage.meta(if applicable) with the providedmetadictionary, and addsall_messagescontaining the full replies list. - Document reference processing (if documents are provided):
- If
reference_patternis set, extracts 1-based document indices from the reply text. - Each document receives a
"source_index"metadata field (1-based position in the input list). - Each document receives a
"referenced"boolean metadata field when reference parsing is active. - Out-of-range indices are logged as warnings and skipped.
- If
return_only_referenced_documentsisTrue, only referenced documents are included.
- If
- Answer text extraction: Applies the regex pattern to extract the answer string. If the pattern does not match, an empty string is returned.
- Constructs a
GeneratedAnswerwith the extracted text, query, processed documents, and merged metadata.
- Extracts text content: uses
Static Helper Methods
_extract_answer_string
@staticmethod
def _extract_answer_string(reply: str, pattern: str | None = None) -> str
Extracts the answer from the reply using the regex pattern. Returns the full reply if no pattern is specified, the capture group if present, or an empty string if no match is found.
_extract_reference_idxs
@staticmethod
def _extract_reference_idxs(reply: str, reference_pattern: str) -> set[int]
Extracts all document reference indices from the reply text. Converts 1-based references to 0-based indices for internal processing.
_check_num_groups_in_regex
@staticmethod
def _check_num_groups_in_regex(pattern: str)
Validates that a regex pattern contains at most one capture group. Raises ValueError for patterns with multiple groups.
Usage Examples
Basic Answer Extraction
from haystack.components.builders import AnswerBuilder
builder = AnswerBuilder(pattern="Answer: (.*)")
result = builder.run(
query="What's the answer?",
replies=["This is an argument. Answer: This is the answer."],
)
# result["answers"][0].data == "This is the answer."
With Documents and Reference Parsing
from haystack import Document
from haystack.components.builders import AnswerBuilder
replies = ["The capital of France is Paris [2]."]
docs = [
Document(content="Berlin is the capital of Germany."),
Document(content="Paris is the capital of France."),
Document(content="Rome is the capital of Italy."),
]
builder = AnswerBuilder(reference_pattern="\\[(\\d+)\\]", return_only_referenced_documents=False)
result = builder.run(query="What is the capital of France?", replies=replies, documents=docs)
answer = result["answers"][0]
print(f"Answer: {answer.data}")
# Answer: The capital of France is Paris [2].
for doc in answer.documents:
if doc.meta["referenced"]:
print(f"[{doc.meta['source_index']}] {doc.content}")
# [2] Paris is the capital of France.