Implementation:Huggingface Datatrove DiskWriter
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, File I/O |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
DiskWriter is an abstract base class that provides the framework for writing pipeline documents to disk in various formats, with support for templated filenames, compression, custom adapters, and automatic file splitting.
Description
The DiskWriter class extends both PipelineStep and ABC (Abstract Base Class) to serve as the foundation for all disk-based writer implementations in Datatrove. It handles the common concerns of file output management, including output folder configuration, filename templating with variable substitution (supporting `${rank}`, `${id}`, and metadata fields), compression (gzip, zstd, or inferred from extension), and file size management with automatic splitting when a configurable maximum file size is exceeded.
The class provides a pluggable adapter mechanism that transforms Document objects into dictionaries suitable for the specific output format. The default adapter converts the Document dataclass to a dictionary, optionally expanding nested metadata into top-level keys when expand_metadata is enabled. Custom adapters can be provided as callable functions. The class also supports controlling whether media bytes are included in the output through the save_media_bytes parameter.
File management is handled through an output file manager obtained from the DataFolder, which supports concurrent access to multiple output files. When max_file_size is set (only in binary write mode), the writer automatically switches to a new file with an incremented counter prefix (e.g., `000_`, `001_`) when the current file exceeds the limit. Subclasses must implement the abstract _write method to define format-specific serialization logic.
Usage
Do not instantiate DiskWriter directly. Instead, use one of its concrete subclasses (such as JsonlWriter, ParquetWriter, or CSVWriter) to write documents in specific formats. Extend this class when implementing a new output format by providing a _write method.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/writers/disk_base.py
- Lines: 1-208
Signature
class DiskWriter(PipelineStep, ABC):
def __init__(
self,
output_folder: DataFolderLike,
output_filename: str = None,
compression: str | None = "infer",
adapter: Callable = None,
mode: str = "wt",
expand_metadata: bool = False,
max_file_size: int = -1,
save_media_bytes: bool = False,
):
Import
from datatrove.pipeline.writers.disk_base import DiskWriter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | A string, tuple, or DataFolder specifying where data should be saved |
| output_filename | str | No | Filename template with placeholders like `${rank}` or metadata tags (default: subclass-defined) |
| compression | str or None | No | Compression scheme: "gzip", "zstd", "infer", or None (default: "infer") |
| adapter | Callable | No | Custom function to transform Document objects to output dictionaries (default: None) |
| mode | str | No | File open mode, e.g., "wt" for text or "wb" for binary (default: "wt") |
| expand_metadata | bool | No | Whether to expand metadata dict into top-level keys (default: False) |
| max_file_size | int | No | Maximum file size in bytes before splitting; -1 for unlimited (default: -1) |
| save_media_bytes | bool | No | Whether to include media bytes in the output (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| data files | Various formats | Output files written to the specified output_folder in the format defined by the subclass |
| DocumentsPipeline | Generator | Documents are yielded through after writing, enabling pipeline chaining |
Usage Examples
Basic Usage
# DiskWriter is abstract; use a concrete subclass such as JsonlWriter
from datatrove.pipeline.writers import JsonlWriter
writer = JsonlWriter(
output_folder="output/data/",
output_filename="${rank}.jsonl.gz",
compression="gzip",
expand_metadata=True,
)