Implementation:Huggingface Datatrove Document
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Software Architecture |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Defines the core data structures -- Document, Media, and MediaType -- that flow through every stage of the datatrove processing pipeline.
Description
The Document dataclass is the universal unit of data in datatrove. Every pipeline step (readers, filters, extractors, writers, deduplicators) receives and produces Document objects through a generator-based pipeline. A document carries four fields: text (the actual textual content), id (a unique string identifier), media (a list of associated Media objects), and metadata (an open dictionary for arbitrary annotations).
The Media dataclass represents multimedia content associated with a document. It contains an id, type (from MediaType integer constants), url, and optional fields for alt text, path, byte offset/length, raw media_bytes, and a metadata dictionary. The MediaType class defines integer constants for IMAGE (0), VIDEO (1), AUDIO (2), and DOCUMENT (3).
Both Document and Media use slots=True for memory efficiency, which is important when processing millions of documents. The module also defines DocumentsPipeline as a NewType alias for Generator[Document, None, None] | None, formalizing the type of the data stream that flows between pipeline steps.
Usage
Use the Document class whenever creating, modifying, or consuming data within a datatrove pipeline. Readers create Documents, filters accept and yield Documents, and writers consume Documents. The metadata dictionary is the standard mechanism for pipeline steps to communicate annotations (e.g., quality scores, language tags, contamination flags) to downstream steps.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/data.py
- Lines: 1-58
Signature
class MediaType:
IMAGE = 0
VIDEO = 1
AUDIO = 2
DOCUMENT = 3
@dataclass(slots=True)
class Media:
id: str
type: int
url: str
alt: str | None = None
path: str | None = None
offset: int | None = None
length: int | None = None
media_bytes: bytes | None = None
metadata: dict[str, Any] = field(default_factory=dict)
@dataclass(slots=True)
class Document:
text: str
id: str
media: list[Media] = field(default_factory=list)
metadata: dict[str, Any] = field(default_factory=dict)
DocumentsPipeline = NewType("DocumentsPipeline", Generator[Document, None, None] | None)
Import
from datatrove.data import Document, Media, MediaType, DocumentsPipeline
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | The textual content of the document |
| id | str | Yes | A unique identifier for this document |
| media | list[Media] | No | Associated media objects (defaults to empty list) |
| metadata | dict[str, Any] | No | Arbitrary key-value annotations (defaults to empty dict) |
Outputs
| Name | Type | Description |
|---|---|---|
| Document | dataclass instance | A document carrying text, id, media, and metadata through the pipeline |
| DocumentsPipeline | Generator[Document] or None | The type alias for the data stream between pipeline steps |
Usage Examples
Basic Usage
from datatrove.data import Document, Media, MediaType
# Create a simple document
doc = Document(
text="This is the document content.",
id="doc_001",
metadata={"source": "web", "language": "en"}
)
# Access fields
print(doc.text) # "This is the document content."
print(doc.id) # "doc_001"
print(doc.metadata) # {"source": "web", "language": "en"}
# Create a document with media
media = Media(id="img_001", type=MediaType.IMAGE, url="https://example.com/image.png")
doc_with_media = Document(
text="Document with an image.",
id="doc_002",
media=[media]
)