Implementation:Huggingface Datatrove ParquetWriter
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove | Data_Output, Columnar_Storage | 2026-02-14 |
Overview
Concrete writer class that serializes Document objects to Apache Parquet files using pyarrow, with batched writes and HuggingFace Hub-optimized settings.
Description
ParquetWriter extends DiskWriter to produce Parquet output. Documents are accumulated in memory batches (default 1000 rows) before being flushed as RecordBatch objects to a pyarrow.parquet.ParquetWriter. This batching reduces the number of I/O operations and produces well-structured row groups. The writer supports automatic file splitting at a configurable max file size (default 5 GB), schema inference from the first document or an explicit schema, and multiple compression codecs.
Two HuggingFace-specific optimizations are enabled by default: content-defined chunking (CDC) with configurable min/max chunk sizes and normalization level, and page index writing for efficient random access. The CDC default options use a minimum chunk size of 256 KB, maximum of 1 MB, and normalization level 0.
Usage
Place as the final step in a datatrove pipeline to produce Parquet output files suitable for upload to the HuggingFace Hub or consumption by columnar query engines.
Code Reference
Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/writers/parquet.py (L11-106)
Signature:
class ParquetWriter(DiskWriter):
def __init__(
self,
output_folder: DataFolderLike,
output_filename: str = None,
compression: Literal["snappy", "gzip", "brotli", "lz4", "zstd"] | None = "snappy",
adapter: Callable = None,
batch_size: int = 1000,
expand_metadata: bool = False,
max_file_size: int = 5 * 2**30, # 5GB
schema: Any = None,
save_media_bytes=False,
use_content_defined_chunking=True,
write_page_index=True,
):
Import:
from datatrove.pipeline.writers import ParquetWriter
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Local path, remote URI, or DataFolder where Parquet files are written |
| output_filename | str | No | Filename template with placeholders (default ${rank}.parquet) |
| compression | Literal or None | No | Compression codec: "snappy", "gzip", "brotli", "lz4", "zstd", or None (default "snappy") |
| adapter | Callable | No | Custom function to transform Document to output dict |
| batch_size | int | No | Number of rows to accumulate before flushing a batch (default 1000) |
| expand_metadata | bool | No | If True, flatten metadata keys to top-level columns (default False) |
| max_file_size | int | No | Max bytes per file before splitting (default 5 GB) |
| schema | Any | No | Explicit PyArrow schema; if None, inferred from first document |
| save_media_bytes | bool | No | If True, include media bytes in output (default False) |
| use_content_defined_chunking | bool or dict | No | Enable CDC for HuggingFace-optimized files (default True) |
| write_page_index | bool | No | Write page indexes for efficient random access (default True) |
Outputs:
- Parquet files on disk or remote storage with text and metadata columns
- Each file contains batched row groups with the configured compression codec
Usage Examples
Example 1 -- Basic Parquet output with snappy compression:
from datatrove.pipeline.writers import ParquetWriter
writer = ParquetWriter(
output_folder="s3://my-bucket/output/",
compression="snappy",
)
Example 2 -- Custom batch size with zstd compression and explicit schema:
import pyarrow as pa
from datatrove.pipeline.writers import ParquetWriter
schema = pa.schema([
("text", pa.string()),
("id", pa.string()),
("url", pa.string()),
])
writer = ParquetWriter(
output_folder="/data/output",
batch_size=5000,
compression="zstd",
schema=schema,
expand_metadata=True,
)