Implementation:Huggingface Datasets ParquetDatasetReader
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for importing Apache Parquet columnar files into the HuggingFace Dataset format provided by the HuggingFace Datasets library.
Description
ParquetDatasetReader is a reader class that extends AbstractDatasetReader and uses the packaged Parquet builder to read one or more Parquet files into an Arrow-backed Dataset or IterableDataset. Because Parquet and Arrow share a columnar memory model, the import is highly efficient. The reader supports configurable features, caching, in-memory loading, streaming mode, and multiprocessing. All additional keyword arguments are forwarded to the underlying Parquet builder.
Usage
Use ParquetDatasetReader when you need to programmatically load Parquet files into a HuggingFace Dataset. It is typically invoked indirectly via Dataset.from_parquet() or load_dataset("parquet", ...), but can also be instantiated directly.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/io/parquet.py - Lines: L19-L72
Signature
class ParquetDatasetReader(AbstractDatasetReader):
def __init__(
self,
path_or_paths: NestedDataStructureLike[PathLike],
split: Optional[NamedSplit] = None,
features: Optional[Features] = None,
cache_dir: str = None,
keep_in_memory: bool = False,
streaming: bool = False,
num_proc: Optional[int] = None,
**kwargs,
):
def read(self):
Import
from datasets.io.parquet import ParquetDatasetReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path_or_paths | NestedDataStructureLike[PathLike] |
Yes | Path(s) to Parquet file(s). Can be a single path, a list of paths, or a dict mapping split names to paths. |
| split | Optional[NamedSplit] |
No | Name of the dataset split to assign to the loaded data. |
| features | Optional[Features] |
No | Explicit schema to apply instead of reading from the Parquet metadata. |
| cache_dir | str |
No | Directory for caching the processed dataset. |
| keep_in_memory | bool |
No | Whether to keep the dataset in memory instead of memory-mapping. Defaults to False. |
| streaming | bool |
No | If True, returns an IterableDataset for streaming access. Defaults to False. |
| num_proc | Optional[int] |
No | Number of processes for parallel dataset preparation. |
| **kwargs | No | Additional keyword arguments forwarded to the Parquet builder. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset or IterableDataset |
The loaded dataset, either map-style or iterable depending on the streaming parameter. |
Usage Examples
Basic Usage
from datasets.io.parquet import ParquetDatasetReader
# Load a single Parquet file
reader = ParquetDatasetReader("data/train.parquet", split="train")
dataset = reader.read()
# Load multiple Parquet files
reader = ParquetDatasetReader(["data/part-0.parquet", "data/part-1.parquet"], split="train")
dataset = reader.read()
# Load with streaming
reader = ParquetDatasetReader("data/train.parquet", streaming=True)
iterable_dataset = reader.read()