Implementation:Datajuicer Data juicer ParquetFormatter

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Loading, Formatting
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for loading and formatting Parquet files as datasets provided by Data-Juicer.

Description

ParquetFormatter extends LocalFormatter with SUFFIXES = ['.parquet'] and type='parquet', delegating all loading logic to the parent class which uses HuggingFace's load_dataset with the Parquet reader. It is registered with the FORMATTERS registry via @FORMATTERS.register_module(). Parquet is an efficient columnar storage format commonly used for large-scale data processing.

Usage

Use when loading Parquet-formatted datasets into Data-Juicer for processing, either through direct instantiation or through the automatic format detection system.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/format/parquet_formatter.py

Signature

@FORMATTERS.register_module()
class ParquetFormatter(LocalFormatter):
    SUFFIXES = [".parquet"]

    def __init__(self, dataset_path, suffixes=None, **kwargs):

Import

from data_juicer.format.parquet_formatter import ParquetFormatter

I/O Contract

Inputs

Name	Type	Required	Description
dataset_path	str	Yes	Path to a Parquet dataset file or directory containing Parquet files
suffixes	list	No	File suffixes to be processed. Default: ['.parquet']
**kwargs	Any	No	Extra arguments passed to the parent LocalFormatter

Outputs

Name	Type	Description
dataset	Dataset	A unified HuggingFace Dataset loaded from the Parquet files

Usage Examples

from data_juicer.format.parquet_formatter import ParquetFormatter

# Load a Parquet dataset
formatter = ParquetFormatter(dataset_path="/path/to/data.parquet")
dataset = formatter.load_dataset(num_proc=4)

# Load from a directory of Parquet files
formatter = ParquetFormatter(dataset_path="/path/to/parquet_dir/")
dataset = formatter.load_dataset()

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment