Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset To Parquet

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for exporting a HuggingFace Dataset to Apache Parquet columnar format provided by the HuggingFace Datasets library.

Description

Dataset.to_parquet is a method that serializes the dataset's Arrow data to Parquet format using PyArrow's ParquetWriter. Internally, it delegates to ParquetDatasetWriter, which writes Arrow record batches as Parquet row groups with per-column compression (Snappy for regular columns, uncompressed for embedded media). The writer supports content-defined chunking for reproducible row group boundaries, page index generation for predicate pushdown, configurable batch sizes, remote storage via fsspec, and all PyArrow ParquetWriter keyword arguments.

Usage

Use Dataset.to_parquet when you need to persist a dataset in a compact, compressed columnar format for long-term storage, sharing on the Hugging Face Hub, or consumption by analytical engines (Spark, DuckDB, BigQuery, Polars).

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L5258-L5296

Signature

def to_parquet(
    self,
    path_or_buf: Union[PathLike, BinaryIO],
    batch_size: Optional[int] = None,
    storage_options: Optional[dict] = None,
    **parquet_writer_kwargs,
) -> int:

Import

from datasets import Dataset
# to_parquet is a method on Dataset instances

I/O Contract

Inputs

Name Type Required Description
path_or_buf Union[PathLike, BinaryIO] Yes Path to a file, a remote URI, or a binary file object where the Parquet data will be written.
batch_size Optional[int] No Number of rows per row group. Defaults to an automatically computed value targeting ~100MB uncompressed row groups.
storage_options Optional[dict] No Key/value pairs for the fsspec file-system backend.
**parquet_writer_kwargs No Additional keyword arguments forwarded to pyarrow.parquet.ParquetWriter.

Outputs

Name Type Description
written int The number of bytes written.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("csv", data_files="input.csv", split="train")

# Export to Parquet
num_bytes = ds.to_parquet("output.parquet")

# Export to a remote location
num_bytes = ds.to_parquet(
    "hf://datasets/username/my_dataset/data.parquet",
    storage_options={"token": "hf_..."},
)

# Export with custom batch size
num_bytes = ds.to_parquet("output.parquet", batch_size=5000)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment