Implementation:Huggingface Datasets ParquetDatasetReader

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for importing Apache Parquet columnar files into the HuggingFace Dataset format provided by the HuggingFace Datasets library.

Description

ParquetDatasetReader is a reader class that extends AbstractDatasetReader and uses the packaged Parquet builder to read one or more Parquet files into an Arrow-backed Dataset or IterableDataset. Because Parquet and Arrow share a columnar memory model, the import is highly efficient. The reader supports configurable features, caching, in-memory loading, streaming mode, and multiprocessing. All additional keyword arguments are forwarded to the underlying Parquet builder.

Usage

Use ParquetDatasetReader when you need to programmatically load Parquet files into a HuggingFace Dataset. It is typically invoked indirectly via Dataset.from_parquet() or load_dataset("parquet", ...), but can also be instantiated directly.

Code Reference

Source Location

Repository: datasets
File: src/datasets/io/parquet.py
Lines: L19-L72

Signature

class ParquetDatasetReader(AbstractDatasetReader):
    def __init__(
        self,
        path_or_paths: NestedDataStructureLike[PathLike],
        split: Optional[NamedSplit] = None,
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        streaming: bool = False,
        num_proc: Optional[int] = None,
        **kwargs,
    ):

    def read(self):

Import

from datasets.io.parquet import ParquetDatasetReader

I/O Contract

Inputs

Name	Type	Required	Description
path_or_paths	`NestedDataStructureLike[PathLike]`	Yes	Path(s) to Parquet file(s). Can be a single path, a list of paths, or a dict mapping split names to paths.
split	`Optional[NamedSplit]`	No	Name of the dataset split to assign to the loaded data.
features	`Optional[Features]`	No	Explicit schema to apply instead of reading from the Parquet metadata.
cache_dir	`str`	No	Directory for caching the processed dataset.
keep_in_memory	`bool`	No	Whether to keep the dataset in memory instead of memory-mapping. Defaults to False.
streaming	`bool`	No	If True, returns an IterableDataset for streaming access. Defaults to False.
num_proc	`Optional[int]`	No	Number of processes for parallel dataset preparation.
**kwargs		No	Additional keyword arguments forwarded to the Parquet builder.

Outputs

Name	Type	Description
dataset	`Dataset` or `IterableDataset`	The loaded dataset, either map-style or iterable depending on the streaming parameter.

Usage Examples

Basic Usage

from datasets.io.parquet import ParquetDatasetReader

# Load a single Parquet file
reader = ParquetDatasetReader("data/train.parquet", split="train")
dataset = reader.read()

# Load multiple Parquet files
reader = ParquetDatasetReader(["data/part-0.parquet", "data/part-1.parquet"], split="train")
dataset = reader.read()

# Load with streaming
reader = ParquetDatasetReader("data/train.parquet", streaming=True)
iterable_dataset = reader.read()

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Parquet_Import

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment