Principle:Huggingface Datatrove Data Reading Framework

Knowledge Sources	Huggingface_Datatrove
Domains	Data Ingestion, Pipeline Architecture, Software Design
Last Updated	2026-02-14 17:00 GMT

Overview

The Data Reading Framework principle defines a layered, extensible architecture for ingesting data from heterogeneous sources, transforming raw records into standardized Document objects, and distributing work across parallel workers.

Description

Data ingestion is the entry point of any processing pipeline, and its design determines how easily the system can be extended to support new formats and data sources. The framework establishes a two-tier hierarchy: a base reader that handles document creation, data adaptation, and lifecycle management; and a disk reader that adds filesystem abstraction, sharding, and file-level iteration.

The adapter pattern is central to this design. Rather than requiring each reader implementation to know the Document schema, the framework provides a pluggable adapter function that transforms raw data dictionaries into the canonical Document format. The default adapter handles common cases (extracting text, id, and metadata fields), while custom adapters can reshape arbitrarily structured data. This separation allows the same reader to process different data schemas without code changes.

The sharding model ensures correct parallel execution. The framework divides available files across workers based on rank and world_size, guaranteeing that each worker processes a distinct subset. This file-level sharding is coarse-grained but effective for large datasets, avoiding the overhead of record-level coordination between workers.

Usage

Apply this framework when designing data ingestion components that need to support multiple file formats, storage backends, or data schemas. Use the two-tier hierarchy (abstract reader + disk reader) to separate format-specific parsing from filesystem access and sharding concerns.

Theoretical Basis

The key concepts underlying the data reading framework are:

Adapter Pattern: The adapter function decouples the raw data format from the internal Document representation. This allows readers to be format-agnostic and supports schema evolution without modifying reader implementations.

Template Method Pattern: The base classes define the skeleton of the reading algorithm (shard computation, file iteration, document creation, statistics) while deferring format-specific parsing to the read_file abstract method in subclasses.

File-Level Sharding: Work distribution is based on dividing the list of input files across workers. Each worker computes its own shard using data_folder.get_shard(rank, world_size), ensuring deterministic, non-overlapping assignment without inter-worker communication.

Generator-Based Streaming: The framework uses Python generators throughout (read_file yields documents, read_files_shard yields across files, run yields to the pipeline). This streaming approach ensures constant memory usage regardless of dataset size.

Progressive Filtering: The skip and limit parameters enable efficient debugging and sampling by short-circuiting the reading process. Skip is applied first, followed by the limit, allowing precise control over which documents enter the pipeline.

Metadata Enrichment: Default metadata and file path metadata are automatically merged into every document, ensuring provenance information is always available for downstream processing and debugging.

Related Pages

Implementation:Huggingface_Datatrove_BaseReader

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment