Implementation:Datajuicer Data juicer DatasetBuilder Load Dataset
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ETL |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for loading datasets from multiple source types into a unified DJDataset provided by the Data-Juicer framework.
Description
The DatasetBuilder class selects a loading strategy based on the executor type and dataset path, then loads data into either a NestedDataset (HuggingFace Dataset wrapper) or RayDataset (Ray Dataset wrapper). It supports local files (JSONL, Parquet, CSV, JSON), HuggingFace Hub paths, and S3 remote sources. Multiple datasets can be concatenated with configurable weights.
Usage
Use this class after calling init_configs to load the dataset specified in the configuration. The builder automatically selects the correct loading strategy based on executor_type (default vs ray).
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/core/data/dataset_builder.py
- Lines: L18-160
Signature
class DatasetBuilder:
def __init__(self, cfg: Namespace, executor_type: str = 'default'):
"""
Args:
cfg: Parsed config namespace with dataset_path or dataset dict.
executor_type: 'default' for HuggingFace, 'ray' for distributed.
"""
def load_dataset(self, **kwargs) -> DJDataset:
"""
Load dataset using the selected strategy.
Args:
**kwargs: Passed to the loading strategy (e.g. num_proc).
Returns:
NestedDataset or RayDataset depending on executor_type.
"""
Import
from data_juicer.core.data import DatasetBuilder
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cfg | Namespace | Yes | Config with dataset_path or dataset dict |
| executor_type | str | No | 'default' or 'ray' (default: 'default') |
| **kwargs | dict | No | Extra args passed to loading strategy (e.g. num_proc) |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | NestedDataset or RayDataset | Loaded dataset ready for operator processing |
Usage Examples
Basic Dataset Loading
from data_juicer.config import init_configs
from data_juicer.core.data import DatasetBuilder
cfg = init_configs(args=['--config', 'pipeline.yaml'])
builder = DatasetBuilder(cfg, executor_type='default')
dataset = builder.load_dataset()
print(len(dataset)) # Number of samples
print(dataset[0]) # First sample dict
Ray Distributed Loading
from data_juicer.config import init_configs
from data_juicer.core.data import DatasetBuilder
cfg = init_configs(args=['--config', 'ray_pipeline.yaml'])
builder = DatasetBuilder(cfg, executor_type='ray')
dataset = builder.load_dataset()
# Returns RayDataset wrapping ray.data.Dataset