Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datasets ParquetDatasetReader

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for importing Apache Parquet columnar files into the HuggingFace Dataset format provided by the HuggingFace Datasets library.

Description

ParquetDatasetReader is a reader class that extends AbstractDatasetReader and uses the packaged Parquet builder to read one or more Parquet files into an Arrow-backed Dataset or IterableDataset. Because Parquet and Arrow share a columnar memory model, the import is highly efficient. The reader supports configurable features, caching, in-memory loading, streaming mode, and multiprocessing. All additional keyword arguments are forwarded to the underlying Parquet builder.

Usage

Use ParquetDatasetReader when you need to programmatically load Parquet files into a HuggingFace Dataset. It is typically invoked indirectly via Dataset.from_parquet() or load_dataset("parquet", ...), but can also be instantiated directly.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/io/parquet.py
  • Lines: L19-L72

Signature

class ParquetDatasetReader(AbstractDatasetReader):
    def __init__(
        self,
        path_or_paths: NestedDataStructureLike[PathLike],
        split: Optional[NamedSplit] = None,
        features: Optional[Features] = None,
        cache_dir: str = None,
        keep_in_memory: bool = False,
        streaming: bool = False,
        num_proc: Optional[int] = None,
        **kwargs,
    ):

    def read(self):

Import

from datasets.io.parquet import ParquetDatasetReader

I/O Contract

Inputs

Name Type Required Description
path_or_paths NestedDataStructureLike[PathLike] Yes Path(s) to Parquet file(s). Can be a single path, a list of paths, or a dict mapping split names to paths.
split Optional[NamedSplit] No Name of the dataset split to assign to the loaded data.
features Optional[Features] No Explicit schema to apply instead of reading from the Parquet metadata.
cache_dir str No Directory for caching the processed dataset.
keep_in_memory bool No Whether to keep the dataset in memory instead of memory-mapping. Defaults to False.
streaming bool No If True, returns an IterableDataset for streaming access. Defaults to False.
num_proc Optional[int] No Number of processes for parallel dataset preparation.
**kwargs No Additional keyword arguments forwarded to the Parquet builder.

Outputs

Name Type Description
dataset Dataset or IterableDataset The loaded dataset, either map-style or iterable depending on the streaming parameter.

Usage Examples

Basic Usage

from datasets.io.parquet import ParquetDatasetReader

# Load a single Parquet file
reader = ParquetDatasetReader("data/train.parquet", split="train")
dataset = reader.read()

# Load multiple Parquet files
reader = ParquetDatasetReader(["data/part-0.parquet", "data/part-1.parquet"], split="train")
dataset = reader.read()

# Load with streaming
reader = ParquetDatasetReader("data/train.parquet", streaming=True)
iterable_dataset = reader.read()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment