Principle:Huggingface Datatrove Disk Writing Framework

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, Software Architecture
Last Updated	2026-02-14 17:00 GMT

Overview

The disk writing framework is an architectural pattern that provides a reusable abstract base for writing pipeline output to storage, encapsulating common concerns such as filename templating, compression, format adaptation, and file size management.

Description

Data processing pipelines need to write output in many different formats (JSONL, Parquet, CSV, etc.) to various storage backends (local disk, S3, HDFS). The disk writing framework addresses this by providing a single abstract base class that handles all format-agnostic concerns, allowing concrete implementations to focus solely on serialization logic. This follows the Template Method design pattern where the base class defines the algorithm skeleton and subclasses fill in format-specific steps.

The framework separates what to write (determined by an adapter function that transforms documents into dictionaries) from how to write (determined by the concrete subclass's serialization method) and where to write (determined by the filename template and output folder configuration). This separation enables maximum flexibility and reuse.

Usage

Apply this framework pattern when building data processing pipelines that need to output data in multiple formats. Extend the base writer class to add new output formats; use the existing concrete implementations for standard formats.

Theoretical Basis

The disk writing framework is built on several software engineering principles:

Template Method Pattern: The base class defines the overall write workflow (compute filename, adapt document, write data, update statistics) while deferring the actual serialization to an abstract `_write` method that subclasses implement. This ensures consistent behavior across all output formats.

Filename templating: Output filenames support variable substitution using Python's string.Template syntax. The `${rank}` variable prevents parallel workers from writing to the same file, while metadata-based variables like `${tag}` enable dynamic output routing based on document properties. This is essential for distributed processing where multiple workers produce output simultaneously.

Automatic file splitting: When writing large datasets, individual output files may exceed practical size limits for downstream tools or storage systems. The framework supports automatic file splitting at a configurable byte threshold, prepending incrementing counters (e.g., `000_`, `001_`) to filenames. This is restricted to binary write mode to ensure accurate byte-level size tracking.

Adapter pattern: The pluggable adapter function decouples the internal Document representation from the output format. The default adapter performs a straightforward dataclass-to-dictionary conversion with optional metadata expansion, but users can supply custom adapters to reshape data for specific downstream consumers.

Context manager protocol: The framework implements Python's context manager protocol (`__enter__`/`__exit__`) to ensure that output files are properly flushed and closed even in the presence of exceptions, preventing data loss or corruption.

Related Pages

Implementation:Huggingface_Datatrove_DiskWriter

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment