Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove Document

From Leeroopedia
Revision as of 13:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datatrove_Document.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data Processing, Software Architecture
Last Updated 2026-02-14 17:00 GMT

Overview

Defines the core data structures -- Document, Media, and MediaType -- that flow through every stage of the datatrove processing pipeline.

Description

The Document dataclass is the universal unit of data in datatrove. Every pipeline step (readers, filters, extractors, writers, deduplicators) receives and produces Document objects through a generator-based pipeline. A document carries four fields: text (the actual textual content), id (a unique string identifier), media (a list of associated Media objects), and metadata (an open dictionary for arbitrary annotations).

The Media dataclass represents multimedia content associated with a document. It contains an id, type (from MediaType integer constants), url, and optional fields for alt text, path, byte offset/length, raw media_bytes, and a metadata dictionary. The MediaType class defines integer constants for IMAGE (0), VIDEO (1), AUDIO (2), and DOCUMENT (3).

Both Document and Media use slots=True for memory efficiency, which is important when processing millions of documents. The module also defines DocumentsPipeline as a NewType alias for Generator[Document, None, None] | None, formalizing the type of the data stream that flows between pipeline steps.

Usage

Use the Document class whenever creating, modifying, or consuming data within a datatrove pipeline. Readers create Documents, filters accept and yield Documents, and writers consume Documents. The metadata dictionary is the standard mechanism for pipeline steps to communicate annotations (e.g., quality scores, language tags, contamination flags) to downstream steps.

Code Reference

Source Location

Signature

class MediaType:
    IMAGE = 0
    VIDEO = 1
    AUDIO = 2
    DOCUMENT = 3

@dataclass(slots=True)
class Media:
    id: str
    type: int
    url: str
    alt: str | None = None
    path: str | None = None
    offset: int | None = None
    length: int | None = None
    media_bytes: bytes | None = None
    metadata: dict[str, Any] = field(default_factory=dict)

@dataclass(slots=True)
class Document:
    text: str
    id: str
    media: list[Media] = field(default_factory=list)
    metadata: dict[str, Any] = field(default_factory=dict)

DocumentsPipeline = NewType("DocumentsPipeline", Generator[Document, None, None] | None)

Import

from datatrove.data import Document, Media, MediaType, DocumentsPipeline

I/O Contract

Inputs

Name Type Required Description
text str Yes The textual content of the document
id str Yes A unique identifier for this document
media list[Media] No Associated media objects (defaults to empty list)
metadata dict[str, Any] No Arbitrary key-value annotations (defaults to empty dict)

Outputs

Name Type Description
Document dataclass instance A document carrying text, id, media, and metadata through the pipeline
DocumentsPipeline Generator[Document] or None The type alias for the data stream between pipeline steps

Usage Examples

Basic Usage

from datatrove.data import Document, Media, MediaType

# Create a simple document
doc = Document(
    text="This is the document content.",
    id="doc_001",
    metadata={"source": "web", "language": "en"}
)

# Access fields
print(doc.text)       # "This is the document content."
print(doc.id)         # "doc_001"
print(doc.metadata)   # {"source": "web", "language": "en"}

# Create a document with media
media = Media(id="img_001", type=MediaType.IMAGE, url="https://example.com/image.png")
doc_with_media = Document(
    text="Document with an image.",
    id="doc_002",
    media=[media]
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment