Implementation:Huggingface Datatrove Document

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, Software Architecture
Last Updated	2026-02-14 17:00 GMT

Overview

Defines the core data structures -- Document, Media, and MediaType -- that flow through every stage of the datatrove processing pipeline.

Description

The Document dataclass is the universal unit of data in datatrove. Every pipeline step (readers, filters, extractors, writers, deduplicators) receives and produces Document objects through a generator-based pipeline. A document carries four fields: text (the actual textual content), id (a unique string identifier), media (a list of associated Media objects), and metadata (an open dictionary for arbitrary annotations).

The Media dataclass represents multimedia content associated with a document. It contains an id, type (from MediaType integer constants), url, and optional fields for alt text, path, byte offset/length, raw media_bytes, and a metadata dictionary. The MediaType class defines integer constants for IMAGE (0), VIDEO (1), AUDIO (2), and DOCUMENT (3).

Both Document and Media use slots=True for memory efficiency, which is important when processing millions of documents. The module also defines DocumentsPipeline as a NewType alias for Generator[Document, None, None] | None, formalizing the type of the data stream that flows between pipeline steps.

Usage

Use the Document class whenever creating, modifying, or consuming data within a datatrove pipeline. Readers create Documents, filters accept and yield Documents, and writers consume Documents. The metadata dictionary is the standard mechanism for pipeline steps to communicate annotations (e.g., quality scores, language tags, contamination flags) to downstream steps.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/data.py
Lines: 1-58

Signature

class MediaType:
    IMAGE = 0
    VIDEO = 1
    AUDIO = 2
    DOCUMENT = 3

@dataclass(slots=True)
class Media:
    id: str
    type: int
    url: str
    alt: str | None = None
    path: str | None = None
    offset: int | None = None
    length: int | None = None
    media_bytes: bytes | None = None
    metadata: dict[str, Any] = field(default_factory=dict)

@dataclass(slots=True)
class Document:
    text: str
    id: str
    media: list[Media] = field(default_factory=list)
    metadata: dict[str, Any] = field(default_factory=dict)

DocumentsPipeline = NewType("DocumentsPipeline", Generator[Document, None, None] | None)

Import

from datatrove.data import Document, Media, MediaType, DocumentsPipeline

I/O Contract

Inputs

Name	Type	Required	Description
text	str	Yes	The textual content of the document
id	str	Yes	A unique identifier for this document
media	list[Media]	No	Associated media objects (defaults to empty list)
metadata	dict[str, Any]	No	Arbitrary key-value annotations (defaults to empty dict)

Outputs

Name	Type	Description
Document	dataclass instance	A document carrying text, id, media, and metadata through the pipeline
DocumentsPipeline	Generator[Document] or None	The type alias for the data stream between pipeline steps

Usage Examples

Basic Usage

from datatrove.data import Document, Media, MediaType

# Create a simple document
doc = Document(
    text="This is the document content.",
    id="doc_001",
    metadata={"source": "web", "language": "en"}
)

# Access fields
print(doc.text)       # "This is the document content."
print(doc.id)         # "doc_001"
print(doc.metadata)   # {"source": "web", "language": "en"}

# Create a document with media
media = Media(id="img_001", type=MediaType.IMAGE, url="https://example.com/image.png")
doc_with_media = Document(
    text="Document with an image.",
    id="doc_002",
    media=[media]
)

Related Pages

Principle:Huggingface_Datatrove_Core_Data_Structures

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment