Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove DiskWriter

From Leeroopedia
Knowledge Sources
Domains Data Processing, File I/O
Last Updated 2026-02-14 17:00 GMT

Overview

DiskWriter is an abstract base class that provides the framework for writing pipeline documents to disk in various formats, with support for templated filenames, compression, custom adapters, and automatic file splitting.

Description

The DiskWriter class extends both PipelineStep and ABC (Abstract Base Class) to serve as the foundation for all disk-based writer implementations in Datatrove. It handles the common concerns of file output management, including output folder configuration, filename templating with variable substitution (supporting `${rank}`, `${id}`, and metadata fields), compression (gzip, zstd, or inferred from extension), and file size management with automatic splitting when a configurable maximum file size is exceeded.

The class provides a pluggable adapter mechanism that transforms Document objects into dictionaries suitable for the specific output format. The default adapter converts the Document dataclass to a dictionary, optionally expanding nested metadata into top-level keys when expand_metadata is enabled. Custom adapters can be provided as callable functions. The class also supports controlling whether media bytes are included in the output through the save_media_bytes parameter.

File management is handled through an output file manager obtained from the DataFolder, which supports concurrent access to multiple output files. When max_file_size is set (only in binary write mode), the writer automatically switches to a new file with an incremented counter prefix (e.g., `000_`, `001_`) when the current file exceeds the limit. Subclasses must implement the abstract _write method to define format-specific serialization logic.

Usage

Do not instantiate DiskWriter directly. Instead, use one of its concrete subclasses (such as JsonlWriter, ParquetWriter, or CSVWriter) to write documents in specific formats. Extend this class when implementing a new output format by providing a _write method.

Code Reference

Source Location

Signature

class DiskWriter(PipelineStep, ABC):
    def __init__(
        self,
        output_folder: DataFolderLike,
        output_filename: str = None,
        compression: str | None = "infer",
        adapter: Callable = None,
        mode: str = "wt",
        expand_metadata: bool = False,
        max_file_size: int = -1,
        save_media_bytes: bool = False,
    ):

Import

from datatrove.pipeline.writers.disk_base import DiskWriter

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes A string, tuple, or DataFolder specifying where data should be saved
output_filename str No Filename template with placeholders like `${rank}` or metadata tags (default: subclass-defined)
compression str or None No Compression scheme: "gzip", "zstd", "infer", or None (default: "infer")
adapter Callable No Custom function to transform Document objects to output dictionaries (default: None)
mode str No File open mode, e.g., "wt" for text or "wb" for binary (default: "wt")
expand_metadata bool No Whether to expand metadata dict into top-level keys (default: False)
max_file_size int No Maximum file size in bytes before splitting; -1 for unlimited (default: -1)
save_media_bytes bool No Whether to include media bytes in the output (default: False)

Outputs

Name Type Description
data files Various formats Output files written to the specified output_folder in the format defined by the subclass
DocumentsPipeline Generator Documents are yielded through after writing, enabling pipeline chaining

Usage Examples

Basic Usage

# DiskWriter is abstract; use a concrete subclass such as JsonlWriter
from datatrove.pipeline.writers import JsonlWriter

writer = JsonlWriter(
    output_folder="output/data/",
    output_filename="${rank}.jsonl.gz",
    compression="gzip",
    expand_metadata=True,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment