Implementation:Deepset ai Haystack Document Dataclass

Overview

Document is the core data class in Haystack representing a unit of data flowing through pipelines. It can contain text, binary data, metadata, embeddings, and scores. It uses a _BackwardCompatible metaclass to handle migration from Haystack 1.x. This is a pattern document describing a data structure interface.

Source Location

File: haystack/dataclasses/document.py, lines 46-170+
Dataclass: Document (with _BackwardCompatible metaclass)
Decorator: @dataclass
Metaclass: _BackwardCompatible

Import

from haystack import Document

or equivalently:

from haystack.dataclasses import Document

Fields

Field	Type	Default	Description
id	`str`	`""` (auto-generated SHA-256)	Unique identifier; auto-generated from content hash if not set
content	None	`None`	Text content of the document
blob	None	`None`	Binary data (images, audio, etc.)
meta	`dict[str, Any]`	`{}`	Arbitrary JSON-serializable metadata
score	None	`None`	Relevance score assigned by retrievers or rankers
embedding	None	`None`	Dense vector representation of the document
sparse_embedding	None	`None`	Sparse vector representation (e.g., BM25, SPLADE)

Full Dataclass Definition

@dataclass
class Document(metaclass=_BackwardCompatible):
    id: str = field(default="")
    content: str | None = field(default=None)
    blob: ByteStream | None = field(default=None)
    meta: dict[str, Any] = field(default_factory=dict)
    score: float | None = field(default=None)
    embedding: list[float] | None = field(default=None)
    sparse_embedding: SparseEmbedding | None = field(default=None)

ID Generation

The __post_init__ method automatically generates an ID if none is provided:

def __post_init__(self):
    self.id = self.id or self._create_id()

def _create_id(self) -> str:
    text = self.content or None
    blob = self.blob.data if self.blob is not None else None
    mime_type = self.blob.mime_type if self.blob is not None else None
    meta = self.meta or {}
    embedding = self.embedding if self.embedding is not None else None
    sparse_embedding = self.sparse_embedding.to_dict() if self.sparse_embedding is not None else ""
    data = f"{text}{None}{blob!r}{mime_type}{meta}{embedding}{sparse_embedding}"
    return hashlib.sha256(data.encode("utf-8")).hexdigest()

The ID is a SHA-256 hash of a concatenated string of all content fields, ensuring deterministic IDs for deduplication.

Methods

to_dict(flatten=True)

def to_dict(self, flatten: bool = True) -> dict[str, Any]:

Converts the Document to a dictionary:

Converts blob and sparse_embedding fields using their respective to_dict() methods.
With flatten=True (default): metadata keys are promoted to top-level dictionary keys for backward compatibility with Haystack 1.x.
With flatten=False: metadata remains nested under the "meta" key.

from_dict(data)

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "Document":

Reconstructs a Document from a dictionary:

Deserializes blob using ByteStream.from_dict().
Deserializes sparse_embedding using SparseEmbedding.from_dict().
Handles both flattened and nested metadata formats. Any unknown keys (not Document fields or legacy fields) are treated as flattened metadata.
Raises ValueError if both flattened metadata keys and a "meta" parameter are provided simultaneously.

_BackwardCompatible Metaclass

class _BackwardCompatible(type):
    def __call__(cls, *args, **kwargs):
        # Validate content is str or None
        # Convert NumPy array embeddings to list[float]
        # Remove legacy fields: content_type, id_hash_keys, dataframe
        return super().__call__(*args, **kwargs)

The metaclass intercepts Document construction to:

Validate that content is a string or None.
Convert NumPy array embeddings to Python lists (Haystack 1.x stored embeddings as NumPy arrays).
Remove legacy fields (content_type, id_hash_keys, dataframe) silently.

Equality and Representation

def __eq__(self, other):
    if type(self) != type(other):
        return False
    return self.to_dict() == other.to_dict()

def __repr__(self):
    # Truncates content to 100 characters
    # Shows blob size, meta, score, embedding size, sparse embedding size

Two Documents are equal if their to_dict() representations are identical. The __repr__ method provides a human-readable summary with truncated content.

Usage Example

from haystack import Document

# Create a text document
doc = Document(
    content="Python is a popular programming language",
    meta={"source": "wikipedia", "page_number": 1},
)
print(doc.id)  # SHA-256 hash auto-generated

# Create a document with explicit ID
doc_explicit = Document(
    id="my-custom-id",
    content="Some content",
)

# Serialization round-trip
doc_dict = doc.to_dict(flatten=False)
restored = Document.from_dict(doc_dict)
assert doc == restored

# Document with embedding
doc_embedded = Document(
    content="Python is great",
    embedding=[0.1, 0.2, 0.3, 0.4],
)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment