Implementation:Datajuicer Data juicer DatasetBuilder Load Dataset

Knowledge Sources	Data-Juicer HuggingFace Datasets
Domains	Data_Engineering, ETL
Last Updated	2026-02-14 17:00 GMT

Overview

Concrete tool for loading datasets from multiple source types into a unified DJDataset provided by the Data-Juicer framework.

Description

The DatasetBuilder class selects a loading strategy based on the executor type and dataset path, then loads data into either a NestedDataset (HuggingFace Dataset wrapper) or RayDataset (Ray Dataset wrapper). It supports local files (JSONL, Parquet, CSV, JSON), HuggingFace Hub paths, and S3 remote sources. Multiple datasets can be concatenated with configurable weights.

Usage

Use this class after calling init_configs to load the dataset specified in the configuration. The builder automatically selects the correct loading strategy based on executor_type (default vs ray).

Code Reference

Source Location

Repository: data-juicer
File: data_juicer/core/data/dataset_builder.py
Lines: L18-160

Signature

class DatasetBuilder:
    def __init__(self, cfg: Namespace, executor_type: str = 'default'):
        """
        Args:
            cfg: Parsed config namespace with dataset_path or dataset dict.
            executor_type: 'default' for HuggingFace, 'ray' for distributed.
        """

    def load_dataset(self, **kwargs) -> DJDataset:
        """
        Load dataset using the selected strategy.

        Args:
            **kwargs: Passed to the loading strategy (e.g. num_proc).

        Returns:
            NestedDataset or RayDataset depending on executor_type.
        """

Import

from data_juicer.core.data import DatasetBuilder

I/O Contract

Inputs

Name	Type	Required	Description
cfg	Namespace	Yes	Config with dataset_path or dataset dict
executor_type	str	No	'default' or 'ray' (default: 'default')
**kwargs	dict	No	Extra args passed to loading strategy (e.g. num_proc)

Outputs

Name	Type	Description
dataset	NestedDataset or RayDataset	Loaded dataset ready for operator processing

Usage Examples

Basic Dataset Loading

from data_juicer.config import init_configs
from data_juicer.core.data import DatasetBuilder

cfg = init_configs(args=['--config', 'pipeline.yaml'])
builder = DatasetBuilder(cfg, executor_type='default')
dataset = builder.load_dataset()

print(len(dataset))  # Number of samples
print(dataset[0])    # First sample dict

Ray Distributed Loading

from data_juicer.config import init_configs
from data_juicer.core.data import DatasetBuilder

cfg = init_configs(args=['--config', 'ray_pipeline.yaml'])
builder = DatasetBuilder(cfg, executor_type='ray')
dataset = builder.load_dataset()
# Returns RayDataset wrapping ray.data.Dataset

Related Pages

Implements Principle

Principle:Datajuicer_Data_juicer_Dataset_Loading

Requires Environment

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment