Implementation:Huggingface Datatrove ParquetWriter

Sources	Domains	Last Updated
Huggingface Datatrove	Data_Output, Columnar_Storage	2026-02-14

Overview

Concrete writer class that serializes Document objects to Apache Parquet files using pyarrow, with batched writes and HuggingFace Hub-optimized settings.

Description

ParquetWriter extends DiskWriter to produce Parquet output. Documents are accumulated in memory batches (default 1000 rows) before being flushed as RecordBatch objects to a pyarrow.parquet.ParquetWriter. This batching reduces the number of I/O operations and produces well-structured row groups. The writer supports automatic file splitting at a configurable max file size (default 5 GB), schema inference from the first document or an explicit schema, and multiple compression codecs.

Two HuggingFace-specific optimizations are enabled by default: content-defined chunking (CDC) with configurable min/max chunk sizes and normalization level, and page index writing for efficient random access. The CDC default options use a minimum chunk size of 256 KB, maximum of 1 MB, and normalization level 0.

Usage

Place as the final step in a datatrove pipeline to produce Parquet output files suitable for upload to the HuggingFace Hub or consumption by columnar query engines.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/writers/parquet.py (L11-106)

Signature:

class ParquetWriter(DiskWriter):
    def __init__(
        self,
        output_folder: DataFolderLike,
        output_filename: str = None,
        compression: Literal["snappy", "gzip", "brotli", "lz4", "zstd"] | None = "snappy",
        adapter: Callable = None,
        batch_size: int = 1000,
        expand_metadata: bool = False,
        max_file_size: int = 5 * 2**30,  # 5GB
        schema: Any = None,
        save_media_bytes=False,
        use_content_defined_chunking=True,
        write_page_index=True,
    ):

Import:

from datatrove.pipeline.writers import ParquetWriter

I/O Contract

Inputs:

Parameter	Type	Required	Description
output_folder	DataFolderLike	Yes	Local path, remote URI, or DataFolder where Parquet files are written
output_filename	str	No	Filename template with placeholders (default ${rank}.parquet)
compression	Literal or None	No	Compression codec: "snappy", "gzip", "brotli", "lz4", "zstd", or None (default "snappy")
adapter	Callable	No	Custom function to transform Document to output dict
batch_size	int	No	Number of rows to accumulate before flushing a batch (default 1000)
expand_metadata	bool	No	If True, flatten metadata keys to top-level columns (default False)
max_file_size	int	No	Max bytes per file before splitting (default 5 GB)
schema	Any	No	Explicit PyArrow schema; if None, inferred from first document
save_media_bytes	bool	No	If True, include media bytes in output (default False)
use_content_defined_chunking	bool or dict	No	Enable CDC for HuggingFace-optimized files (default True)
write_page_index	bool	No	Write page indexes for efficient random access (default True)

Outputs:

Parquet files on disk or remote storage with text and metadata columns
Each file contains batched row groups with the configured compression codec

Usage Examples

Example 1 -- Basic Parquet output with snappy compression:

from datatrove.pipeline.writers import ParquetWriter

writer = ParquetWriter(
    output_folder="s3://my-bucket/output/",
    compression="snappy",
)

Example 2 -- Custom batch size with zstd compression and explicit schema:

import pyarrow as pa
from datatrove.pipeline.writers import ParquetWriter

schema = pa.schema([
    ("text", pa.string()),
    ("id", pa.string()),
    ("url", pa.string()),
])

writer = ParquetWriter(
    output_folder="/data/output",
    batch_size=5000,
    compression="zstd",
    schema=schema,
    expand_metadata=True,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment