Implementation:Huggingface Datasets Dataset To Parquet
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for exporting a HuggingFace Dataset to Apache Parquet columnar format provided by the HuggingFace Datasets library.
Description
Dataset.to_parquet is a method that serializes the dataset's Arrow data to Parquet format using PyArrow's ParquetWriter. Internally, it delegates to ParquetDatasetWriter, which writes Arrow record batches as Parquet row groups with per-column compression (Snappy for regular columns, uncompressed for embedded media). The writer supports content-defined chunking for reproducible row group boundaries, page index generation for predicate pushdown, configurable batch sizes, remote storage via fsspec, and all PyArrow ParquetWriter keyword arguments.
Usage
Use Dataset.to_parquet when you need to persist a dataset in a compact, compressed columnar format for long-term storage, sharing on the Hugging Face Hub, or consumption by analytical engines (Spark, DuckDB, BigQuery, Polars).
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L5258-L5296
Signature
def to_parquet(
self,
path_or_buf: Union[PathLike, BinaryIO],
batch_size: Optional[int] = None,
storage_options: Optional[dict] = None,
**parquet_writer_kwargs,
) -> int:
Import
from datasets import Dataset
# to_parquet is a method on Dataset instances
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path_or_buf | Union[PathLike, BinaryIO] |
Yes | Path to a file, a remote URI, or a binary file object where the Parquet data will be written. |
| batch_size | Optional[int] |
No | Number of rows per row group. Defaults to an automatically computed value targeting ~100MB uncompressed row groups. |
| storage_options | Optional[dict] |
No | Key/value pairs for the fsspec file-system backend. |
| **parquet_writer_kwargs | No | Additional keyword arguments forwarded to pyarrow.parquet.ParquetWriter. |
Outputs
| Name | Type | Description |
|---|---|---|
| written | int |
The number of bytes written. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("csv", data_files="input.csv", split="train")
# Export to Parquet
num_bytes = ds.to_parquet("output.parquet")
# Export to a remote location
num_bytes = ds.to_parquet(
"hf://datasets/username/my_dataset/data.parquet",
storage_options={"token": "hf_..."},
)
# Export with custom batch size
num_bytes = ds.to_parquet("output.parquet", batch_size=5000)