Implementation:Deepset ai Haystack AnswerBuilder

Overview

AnswerBuilder is a Haystack component that converts a query and Generator replies into structured GeneratedAnswer objects. It supports regex-based answer extraction, document reference parsing with citation tracking, and works with both plain string replies and ChatMessage objects from chat generators.

Source Location

File: haystack/components/builders/answer_builder.py (Lines 16-257)
Class: AnswerBuilder
Component decorator: @component

Import

from haystack.components.builders import AnswerBuilder

Dependencies

haystack.dataclasses: Provides GeneratedAnswer, Document, and ChatMessage.
re (standard library): Used for regex-based answer extraction and reference parsing.

Constructor

def __init__(
    self,
    pattern: str | None = None,
    reference_pattern: str | None = None,
    last_message_only: bool = False,
    *,
    return_only_referenced_documents: bool = True,
)

Parameters

pattern (str | None): Regular expression pattern to extract the answer text from the generator output. If not specified, the entire response is used as the answer. The pattern may contain at most one capture group:
- No capture group: The whole regex match is used as the answer. Example: [^\n]+$ extracts the last line.
- One capture group: The captured group text is used as the answer. Example: Answer: (.*) extracts everything after "Answer: ".
- Multiple capture groups: Rejected with a ValueError.
reference_pattern (str | None): Regular expression pattern for parsing document references from the generated text. References must be 1-based indices. Example: \[(\d+)\] extracts "1" from "answer[1]". When provided, documents receive a "referenced" metadata field.
last_message_only (bool): If True, only the last reply is processed. If False (default), all replies are processed.
return_only_referenced_documents (bool): When used with reference_pattern, if True (default), only documents actually referenced in the reply are included. If False, all documents are included with reference annotations. Has no effect when reference_pattern is not provided.

Run Method

@component.output_types(answers=list[GeneratedAnswer])
def run(
    self,
    query: str,
    replies: list[str] | list[ChatMessage],
    meta: list[dict[str, Any]] | None = None,
    documents: list[Document] | None = None,
    pattern: str | None = None,
    reference_pattern: str | None = None,
) -> dict:  # Returns {"answers": list[GeneratedAnswer]}

Parameters

query (str): The input query that was used as the generator prompt.
replies (list[str] | list[ChatMessage]): The generator output. Can be plain strings (from non-chat generators) or ChatMessage objects (from chat generators).
meta (list[dict] | None): Optional metadata from the generator, one dictionary per reply. Must match the length of replies if provided.
documents (list[Document] | None): Optional source documents used as generator context. When provided, they are attached to the GeneratedAnswer objects with provenance annotations.
pattern (str | None): Optional runtime override for the answer extraction pattern.
reference_pattern (str | None): Optional runtime override for the reference parsing pattern.

Returns

{"answers": list[GeneratedAnswer]}: A dictionary containing structured answer objects, one per processed reply.

Behavior

Initializes default empty metadata if none is provided; validates that replies and meta lengths match.
Validates any runtime pattern for capture group count.
Selects the pattern and reference_pattern (runtime > init).
If last_message_only is True, restricts processing to the last reply and its metadata.
For each reply:
- Extracts text content: uses .text for ChatMessage objects, str() for strings.
- Extracts metadata: merges ChatMessage.meta (if applicable) with the provided meta dictionary, and adds all_messages containing the full replies list.
- Document reference processing (if documents are provided):
  - If reference_pattern is set, extracts 1-based document indices from the reply text.
  - Each document receives a "source_index" metadata field (1-based position in the input list).
  - Each document receives a "referenced" boolean metadata field when reference parsing is active.
  - Out-of-range indices are logged as warnings and skipped.
  - If return_only_referenced_documents is True, only referenced documents are included.
- Answer text extraction: Applies the regex pattern to extract the answer string. If the pattern does not match, an empty string is returned.
- Constructs a GeneratedAnswer with the extracted text, query, processed documents, and merged metadata.

Static Helper Methods

_extract_answer_string

@staticmethod
def _extract_answer_string(reply: str, pattern: str | None = None) -> str

Extracts the answer from the reply using the regex pattern. Returns the full reply if no pattern is specified, the capture group if present, or an empty string if no match is found.

_extract_reference_idxs

@staticmethod
def _extract_reference_idxs(reply: str, reference_pattern: str) -> set[int]

Extracts all document reference indices from the reply text. Converts 1-based references to 0-based indices for internal processing.

_check_num_groups_in_regex

@staticmethod
def _check_num_groups_in_regex(pattern: str)

Validates that a regex pattern contains at most one capture group. Raises ValueError for patterns with multiple groups.

Usage Examples

Basic Answer Extraction

from haystack.components.builders import AnswerBuilder

builder = AnswerBuilder(pattern="Answer: (.*)")
result = builder.run(
    query="What's the answer?",
    replies=["This is an argument. Answer: This is the answer."],
)
# result["answers"][0].data == "This is the answer."

With Documents and Reference Parsing

from haystack import Document
from haystack.components.builders import AnswerBuilder

replies = ["The capital of France is Paris [2]."]
docs = [
    Document(content="Berlin is the capital of Germany."),
    Document(content="Paris is the capital of France."),
    Document(content="Rome is the capital of Italy."),
]

builder = AnswerBuilder(reference_pattern="\\[(\\d+)\\]", return_only_referenced_documents=False)
result = builder.run(query="What is the capital of France?", replies=replies, documents=docs)

answer = result["answers"][0]
print(f"Answer: {answer.data}")
# Answer: The capital of France is Paris [2].

for doc in answer.documents:
    if doc.meta["referenced"]:
        print(f"[{doc.meta['source_index']}] {doc.content}")
# [2] Paris is the capital of France.

Related Pages

Principle:Deepset_ai_Haystack_Answer_Construction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment