Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer DatasetBuilder Load Dataset

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ETL
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for loading datasets from multiple source types into a unified DJDataset provided by the Data-Juicer framework.

Description

The DatasetBuilder class selects a loading strategy based on the executor type and dataset path, then loads data into either a NestedDataset (HuggingFace Dataset wrapper) or RayDataset (Ray Dataset wrapper). It supports local files (JSONL, Parquet, CSV, JSON), HuggingFace Hub paths, and S3 remote sources. Multiple datasets can be concatenated with configurable weights.

Usage

Use this class after calling init_configs to load the dataset specified in the configuration. The builder automatically selects the correct loading strategy based on executor_type (default vs ray).

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/core/data/dataset_builder.py
  • Lines: L18-160

Signature

class DatasetBuilder:
    def __init__(self, cfg: Namespace, executor_type: str = 'default'):
        """
        Args:
            cfg: Parsed config namespace with dataset_path or dataset dict.
            executor_type: 'default' for HuggingFace, 'ray' for distributed.
        """

    def load_dataset(self, **kwargs) -> DJDataset:
        """
        Load dataset using the selected strategy.

        Args:
            **kwargs: Passed to the loading strategy (e.g. num_proc).

        Returns:
            NestedDataset or RayDataset depending on executor_type.
        """

Import

from data_juicer.core.data import DatasetBuilder

I/O Contract

Inputs

Name Type Required Description
cfg Namespace Yes Config with dataset_path or dataset dict
executor_type str No 'default' or 'ray' (default: 'default')
**kwargs dict No Extra args passed to loading strategy (e.g. num_proc)

Outputs

Name Type Description
dataset NestedDataset or RayDataset Loaded dataset ready for operator processing

Usage Examples

Basic Dataset Loading

from data_juicer.config import init_configs
from data_juicer.core.data import DatasetBuilder

cfg = init_configs(args=['--config', 'pipeline.yaml'])
builder = DatasetBuilder(cfg, executor_type='default')
dataset = builder.load_dataset()

print(len(dataset))  # Number of samples
print(dataset[0])    # First sample dict

Ray Distributed Loading

from data_juicer.config import init_configs
from data_juicer.core.data import DatasetBuilder

cfg = init_configs(args=['--config', 'ray_pipeline.yaml'])
builder = DatasetBuilder(cfg, executor_type='ray')
dataset = builder.load_dataset()
# Returns RayDataset wrapping ray.data.Dataset

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment