Principle:Deepset ai Haystack Extracted Answer Schema

Overview

The extracted answer schema defines the data structure for span-based answers with provenance tracking. It captures the answer text, its precise location within the source document (character offsets), surrounding context, a confidence score, and the source Document object. This schema is a pattern document that specifies a data structure interface used throughout Haystack's extractive QA pipeline.

Domains

NLP
Data_Modeling

Theory

Structured Representation of Extracted Answers

The extracted answer schema provides a structured representation that captures all the information needed to understand and verify an extracted answer:

Answer text (data): The actual text span extracted from the document. When data is None, the entry represents a "no answer" prediction, indicating the model found no confident answer in the provided documents.

Source document (document): A reference to the full Document object from which the answer was extracted. This enables downstream components to access the complete document content, metadata, and other fields.

Document offset (document_offset): A Span object with start and end integer fields representing character positions within the source document's content. This enables precise highlighting and verification of the answer within the original text.

Context (context): The surrounding text that provides additional context for the extracted answer. This is useful for presenting answers with enough surrounding information for the user to evaluate relevance.

Context offset (context_offset): A Span object indicating the position of the answer within the context string (as opposed to within the full document).

Confidence score (score): A float representing the model's confidence in this answer. Scores are comparable across documents and sequences when using implementations that avoid per-document normalization.

Query (query): The original question that produced this answer, maintaining the link between the question and its answer.

Metadata (meta): An extensible dictionary for additional information such as computed page numbers or other downstream annotations.

No-Answer Representation

A special "no answer" entry uses data=None to indicate the model found no confident answer. Its score represents the probability that none of the other extracted answers are correct. This allows downstream components to handle the absence of an answer through the same data structure, avoiding special-case logic.

Span Type

The schema defines a nested Span type with two integer fields:

start: The inclusive start character position.
end: The exclusive end character position.

This Span type is reused for both document_offset and context_offset, providing a consistent interface for positional information.

Protocol Conformance

ExtractedAnswer conforms to the Answer protocol, which requires data, query, meta fields and to_dict() / from_dict() methods. This allows generic handling of different answer types (extracted, generated) through a common interface.

Design Rationale

Provenance tracking: By linking each answer to its source Document and recording character offsets, the schema enables answer verification and highlighting in user interfaces.
Serialization support: The to_dict() and from_dict() methods enable pipeline serialization, caching, and debugging.
Uniform interface: The no-answer entry uses the same data structure as regular answers, simplifying downstream processing.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment