Principle:Deepset ai Haystack Extracted Answer Schema
Overview
The extracted answer schema defines the data structure for span-based answers with provenance tracking. It captures the answer text, its precise location within the source document (character offsets), surrounding context, a confidence score, and the source Document object. This schema is a pattern document that specifies a data structure interface used throughout Haystack's extractive QA pipeline.
Domains
- NLP
- Data_Modeling
Theory
Structured Representation of Extracted Answers
The extracted answer schema provides a structured representation that captures all the information needed to understand and verify an extracted answer:
- Answer text (
data): The actual text span extracted from the document. WhendataisNone, the entry represents a "no answer" prediction, indicating the model found no confident answer in the provided documents.
- Source document (
document): A reference to the full Document object from which the answer was extracted. This enables downstream components to access the complete document content, metadata, and other fields.
- Document offset (
document_offset): A Span object withstartandendinteger fields representing character positions within the source document's content. This enables precise highlighting and verification of the answer within the original text.
- Context (
context): The surrounding text that provides additional context for the extracted answer. This is useful for presenting answers with enough surrounding information for the user to evaluate relevance.
- Context offset (
context_offset): A Span object indicating the position of the answer within the context string (as opposed to within the full document).
- Confidence score (
score): A float representing the model's confidence in this answer. Scores are comparable across documents and sequences when using implementations that avoid per-document normalization.
- Query (
query): The original question that produced this answer, maintaining the link between the question and its answer.
- Metadata (
meta): An extensible dictionary for additional information such as computed page numbers or other downstream annotations.
No-Answer Representation
A special "no answer" entry uses data=None to indicate the model found no confident answer. Its score represents the probability that none of the other extracted answers are correct. This allows downstream components to handle the absence of an answer through the same data structure, avoiding special-case logic.
Span Type
The schema defines a nested Span type with two integer fields:
- start: The inclusive start character position.
- end: The exclusive end character position.
This Span type is reused for both document_offset and context_offset, providing a consistent interface for positional information.
Protocol Conformance
ExtractedAnswer conforms to the Answer protocol, which requires data, query, meta fields and to_dict() / from_dict() methods. This allows generic handling of different answer types (extracted, generated) through a common interface.
Design Rationale
- Provenance tracking: By linking each answer to its source Document and recording character offsets, the schema enables answer verification and highlighting in user interfaces.
- Serialization support: The
to_dict()andfrom_dict()methods enable pipeline serialization, caching, and debugging. - Uniform interface: The no-answer entry uses the same data structure as regular answers, simplifying downstream processing.
Related Pages
Implementation:Deepset_ai_Haystack_ExtractedAnswer_Dataclass