Implementation:Huggingface Datatrove ParquetReader

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, ETL
Last Updated	2026-02-14 17:00 GMT

Overview

ParquetReader is a pipeline reader that reads data from Apache Parquet files in configurable batch sizes, converting rows into Document objects.

Description

ParquetReader extends BaseDiskReader to support ingestion of Parquet files, a widely used columnar storage format in data engineering and machine learning. The reader uses pyarrow.parquet.ParquetFile to open files and iterates over record batches using iter_batches with a configurable batch_size parameter (defaulting to 1000 rows per batch).

A key feature of ParquetReader is the read_metadata option. When set to False, the reader restricts column reads to only the text_key and id_key columns, which can significantly improve performance when Parquet files contain many columns but only the text and ID are needed for processing. When read_metadata is True (the default), all columns are read and their values become part of the document metadata.

Each row within a batch is converted to a dictionary via batch.to_pylist() and then transformed into a Document using the inherited get_document_from_dict method. The class requires pyarrow as a dependency.

Usage

Use ParquetReader when your data is stored in Apache Parquet format, which is the standard format for large-scale datasets in the Hugging Face ecosystem and many data lake architectures. It is ideal for reading datasets from Hugging Face Hub exports, data warehouses, or any system that outputs Parquet files.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/readers/parquet.py
Lines: 1-87

Signature

class ParquetReader(BaseDiskReader):
    name = "📒 Parquet"
    _requires_dependencies = ["pyarrow"]

    def __init__(
        self,
        data_folder: DataFolderLike,
        paths_file: DataFileLike | None = None,
        limit: int = -1,
        skip: int = 0,
        batch_size: int = 1000,
        read_metadata: bool = True,
        file_progress: bool = False,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        recursive: bool = True,
        glob_pattern: str | None = None,
        shuffle_files: bool = False,
    )

Import

from datatrove.pipeline.readers.parquet import ParquetReader

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	DataFolderLike	Yes	Path or filesystem object pointing to the folder containing Parquet files
paths_file	DataFileLike or None	No	Optional file listing specific paths to read (one per line)
limit	int	No	Maximum number of documents to read; -1 means no limit
skip	int	No	Number of initial rows to skip
batch_size	int	No	Number of rows to read per batch (default: 1000)
read_metadata	bool	No	If True, reads all columns; if False, reads only text_key and id_key columns (default: True)
file_progress	bool	No	Whether to show a progress bar for files
doc_progress	bool	No	Whether to show a progress bar for documents
adapter	Callable	No	Custom function to transform raw data dicts into Document-compatible dicts
text_key	str	No	Column name containing document text (default: "text")
id_key	str	No	Column name containing document ID (default: "id")
default_metadata	dict	No	Default metadata added to all documents
recursive	bool	No	Whether to search for files recursively (default: True)
glob_pattern	str or None	No	Glob pattern to filter which files are included
shuffle_files	bool	No	Whether to shuffle files within the returned shard

Outputs

Name	Type	Description
documents	Generator[Document]	Yields Document objects extracted from Parquet row batches

Usage Examples

Basic Usage

from datatrove.pipeline.readers.parquet import ParquetReader

# Read Parquet files from a local directory
reader = ParquetReader(
    data_folder="path/to/parquet/files",
    text_key="text",
    id_key="id",
)

Optimized Column Reading

from datatrove.pipeline.readers.parquet import ParquetReader

# Read only text and ID columns for faster processing
reader = ParquetReader(
    data_folder="s3://my-bucket/dataset/",
    read_metadata=False,
    batch_size=5000,
    glob_pattern="*.parquet",
)

Related Pages

Principle:Huggingface_Datatrove_Parquet_Data_Reading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment