Implementation:Huggingface Datatrove DiskWriter

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, File I/O
Last Updated	2026-02-14 17:00 GMT

Overview

DiskWriter is an abstract base class that provides the framework for writing pipeline documents to disk in various formats, with support for templated filenames, compression, custom adapters, and automatic file splitting.

Description

The DiskWriter class extends both PipelineStep and ABC (Abstract Base Class) to serve as the foundation for all disk-based writer implementations in Datatrove. It handles the common concerns of file output management, including output folder configuration, filename templating with variable substitution (supporting `${rank}`, `${id}`, and metadata fields), compression (gzip, zstd, or inferred from extension), and file size management with automatic splitting when a configurable maximum file size is exceeded.

The class provides a pluggable adapter mechanism that transforms Document objects into dictionaries suitable for the specific output format. The default adapter converts the Document dataclass to a dictionary, optionally expanding nested metadata into top-level keys when expand_metadata is enabled. Custom adapters can be provided as callable functions. The class also supports controlling whether media bytes are included in the output through the save_media_bytes parameter.

File management is handled through an output file manager obtained from the DataFolder, which supports concurrent access to multiple output files. When max_file_size is set (only in binary write mode), the writer automatically switches to a new file with an incremented counter prefix (e.g., `000_`, `001_`) when the current file exceeds the limit. Subclasses must implement the abstract _write method to define format-specific serialization logic.

Usage

Do not instantiate DiskWriter directly. Instead, use one of its concrete subclasses (such as JsonlWriter, ParquetWriter, or CSVWriter) to write documents in specific formats. Extend this class when implementing a new output format by providing a _write method.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/writers/disk_base.py
Lines: 1-208

Signature

class DiskWriter(PipelineStep, ABC):
    def __init__(
        self,
        output_folder: DataFolderLike,
        output_filename: str = None,
        compression: str | None = "infer",
        adapter: Callable = None,
        mode: str = "wt",
        expand_metadata: bool = False,
        max_file_size: int = -1,
        save_media_bytes: bool = False,
    ):

Import

from datatrove.pipeline.writers.disk_base import DiskWriter

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes	A string, tuple, or DataFolder specifying where data should be saved
output_filename	str	No	Filename template with placeholders like `${rank}` or metadata tags (default: subclass-defined)
compression	str or None	No	Compression scheme: "gzip", "zstd", "infer", or None (default: "infer")
adapter	Callable	No	Custom function to transform Document objects to output dictionaries (default: None)
mode	str	No	File open mode, e.g., "wt" for text or "wb" for binary (default: "wt")
expand_metadata	bool	No	Whether to expand metadata dict into top-level keys (default: False)
max_file_size	int	No	Maximum file size in bytes before splitting; -1 for unlimited (default: -1)
save_media_bytes	bool	No	Whether to include media bytes in the output (default: False)

Outputs

Name	Type	Description
data files	Various formats	Output files written to the specified output_folder in the format defined by the subclass
DocumentsPipeline	Generator	Documents are yielded through after writing, enabling pipeline chaining

Usage Examples

Basic Usage

# DiskWriter is abstract; use a concrete subclass such as JsonlWriter
from datatrove.pipeline.writers import JsonlWriter

writer = JsonlWriter(
    output_folder="output/data/",
    output_filename="${rank}.jsonl.gz",
    compression="gzip",
    expand_metadata=True,
)

Related Pages

Principle:Huggingface_Datatrove_Disk_Writing_Framework

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment