Principle:Deepset ai Haystack Document Data Model
Overview
The document data model defines the universal data structure for documents flowing through Haystack pipelines. It carries content, metadata, embeddings, and scores, serving as the common currency that all pipeline components consume and produce. This is a pattern document that specifies a data structure interface.
Domains
- Data_Modeling
- NLP
Theory
Unified Document Representation
The Document data model provides a single, unified representation that all Haystack pipeline components understand. Rather than having different document formats for different stages of the pipeline, every component -- retrievers, readers, rankers, writers, converters -- operates on the same Document structure.
A Document can carry several types of content:
- Text content (
content): A string holding the textual content of the document. This is the primary content type used in most NLP pipelines.
- Binary data (
blob): A ByteStream object for non-text data such as images, audio files, or other binary formats. This allows the same Document type to represent multimodal content.
- Metadata (
meta): An arbitrary dictionary of key-value pairs for storing additional information such as source URLs, file paths, page numbers, timestamps, or any application-specific data. Must be JSON-serializable.
- Dense embedding (
embedding): A list of floats representing a dense vector embedding of the document content, typically produced by an embedding model. Used for semantic similarity search.
- Sparse embedding (
sparse_embedding): A SparseEmbedding object representing a sparse vector (e.g., from BM25 or SPLADE models). Used for lexical or hybrid search.
- Relevance score (
score): A float assigned by retrievers or rankers to indicate the document's relevance to a query.
Automatic ID Generation
Document identity is managed through automatic SHA-256 hashing:
- When no explicit ID is provided, the Document automatically generates one by computing a SHA-256 hash of its content, blob, metadata, embedding, and sparse embedding fields.
- This deterministic ID generation ensures that two Documents with identical content and metadata will always have the same ID, enabling deduplication across pipeline runs.
- The ID is generated in the
__post_init__method, meaning it is computed at object creation time.
Equality Semantics
Two Documents are considered equal if and only if their dictionary representations (to_dict()) are identical. This means equality considers all fields, not just the ID.
Backward Compatibility
The Document class uses a custom _BackwardCompatible metaclass to handle migration from Haystack 1.x:
- Legacy fields (
content_type,id_hash_keys,dataframe) are silently removed if present. - Embeddings stored as NumPy arrays (Haystack 1.x format) are automatically converted to Python lists of floats.
- The
contentfield is validated to be a string or None.
Serialization
The Document supports dictionary and JSON serialization:
to_dict(flatten=True): Converts the Document to a dictionary. Withflatten=True(default), metadata keys are promoted to top-level keys for backward compatibility with Haystack 1.x.from_dict(data): Reconstructs a Document from a dictionary, handling both flattened and nested metadata formats. Raises a ValueError if both formats are mixed.
Design Rationale
- Universal interface: A single Document type simplifies component interfaces and pipeline construction.
- Content-addressable: SHA-256-based IDs enable efficient deduplication in document stores.
- Extensible metadata: The
metadict accommodates arbitrary application-specific information without schema changes. - Multimodal support: The
blobfield allows the same data model to represent text, images, audio, and other binary data. - Backward compatibility: The metaclass approach ensures a smooth migration path from Haystack 1.x.