Implementation:Deepset ai Haystack Document Dataclass
Overview
Document is the core data class in Haystack representing a unit of data flowing through pipelines. It can contain text, binary data, metadata, embeddings, and scores. It uses a _BackwardCompatible metaclass to handle migration from Haystack 1.x. This is a pattern document describing a data structure interface.
Source Location
- File:
haystack/dataclasses/document.py, lines 46-170+ - Dataclass:
Document(with_BackwardCompatiblemetaclass) - Decorator:
@dataclass - Metaclass:
_BackwardCompatible
Import
from haystack import Document
or equivalently:
from haystack.dataclasses import Document
Fields
| Field | Type | Default | Description |
|---|---|---|---|
| id | str |
"" (auto-generated SHA-256) |
Unique identifier; auto-generated from content hash if not set |
| content | None | None |
Text content of the document |
| blob | None | None |
Binary data (images, audio, etc.) |
| meta | dict[str, Any] |
{} |
Arbitrary JSON-serializable metadata |
| score | None | None |
Relevance score assigned by retrievers or rankers |
| embedding | None | None |
Dense vector representation of the document |
| sparse_embedding | None | None |
Sparse vector representation (e.g., BM25, SPLADE) |
Full Dataclass Definition
@dataclass
class Document(metaclass=_BackwardCompatible):
id: str = field(default="")
content: str | None = field(default=None)
blob: ByteStream | None = field(default=None)
meta: dict[str, Any] = field(default_factory=dict)
score: float | None = field(default=None)
embedding: list[float] | None = field(default=None)
sparse_embedding: SparseEmbedding | None = field(default=None)
ID Generation
The __post_init__ method automatically generates an ID if none is provided:
def __post_init__(self):
self.id = self.id or self._create_id()
def _create_id(self) -> str:
text = self.content or None
blob = self.blob.data if self.blob is not None else None
mime_type = self.blob.mime_type if self.blob is not None else None
meta = self.meta or {}
embedding = self.embedding if self.embedding is not None else None
sparse_embedding = self.sparse_embedding.to_dict() if self.sparse_embedding is not None else ""
data = f"{text}{None}{blob!r}{mime_type}{meta}{embedding}{sparse_embedding}"
return hashlib.sha256(data.encode("utf-8")).hexdigest()
The ID is a SHA-256 hash of a concatenated string of all content fields, ensuring deterministic IDs for deduplication.
Methods
to_dict(flatten=True)
def to_dict(self, flatten: bool = True) -> dict[str, Any]:
Converts the Document to a dictionary:
- Converts
blobandsparse_embeddingfields using their respectiveto_dict()methods. - With
flatten=True(default): metadata keys are promoted to top-level dictionary keys for backward compatibility with Haystack 1.x. - With
flatten=False: metadata remains nested under the"meta"key.
from_dict(data)
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "Document":
Reconstructs a Document from a dictionary:
- Deserializes
blobusingByteStream.from_dict(). - Deserializes
sparse_embeddingusingSparseEmbedding.from_dict(). - Handles both flattened and nested metadata formats. Any unknown keys (not Document fields or legacy fields) are treated as flattened metadata.
- Raises
ValueErrorif both flattened metadata keys and a"meta"parameter are provided simultaneously.
_BackwardCompatible Metaclass
class _BackwardCompatible(type):
def __call__(cls, *args, **kwargs):
# Validate content is str or None
# Convert NumPy array embeddings to list[float]
# Remove legacy fields: content_type, id_hash_keys, dataframe
return super().__call__(*args, **kwargs)
The metaclass intercepts Document construction to:
- Validate that
contentis a string or None. - Convert NumPy array embeddings to Python lists (Haystack 1.x stored embeddings as NumPy arrays).
- Remove legacy fields (
content_type,id_hash_keys,dataframe) silently.
Equality and Representation
def __eq__(self, other):
if type(self) != type(other):
return False
return self.to_dict() == other.to_dict()
def __repr__(self):
# Truncates content to 100 characters
# Shows blob size, meta, score, embedding size, sparse embedding size
Two Documents are equal if their to_dict() representations are identical. The __repr__ method provides a human-readable summary with truncated content.
Usage Example
from haystack import Document
# Create a text document
doc = Document(
content="Python is a popular programming language",
meta={"source": "wikipedia", "page_number": 1},
)
print(doc.id) # SHA-256 hash auto-generated
# Create a document with explicit ID
doc_explicit = Document(
id="my-custom-id",
content="Some content",
)
# Serialization round-trip
doc_dict = doc.to_dict(flatten=False)
restored = Document.from_dict(doc_dict)
assert doc == restored
# Document with embedding
doc_embedded = Document(
content="Python is great",
embedding=[0.1, 0.2, 0.3, 0.4],
)