Principle:Datajuicer Data juicer Dataset Loading

Knowledge Sources	Data-Juicer HuggingFace Datasets
Domains	Data_Engineering, ETL
Last Updated	2026-02-14 17:00 GMT

Overview

A strategy-based data ingestion pattern that loads datasets from heterogeneous sources into a unified in-memory representation for processing.

Description

Dataset Loading abstracts the complexity of reading data from multiple source types (local files, remote URLs, HuggingFace Hub, S3 storage) into a single unified interface. It employs the Strategy pattern to select the appropriate loading mechanism based on the source path format, then wraps the result in a framework-specific dataset abstraction that supports nested field access, multimodal data, and lazy evaluation. This solves the problem of handling diverse data formats (JSONL, Parquet, CSV, JSON) and sources transparently.

Usage

Use this principle as the second step in any Data-Juicer pipeline, immediately after Configuration Initialization. It is required whenever raw data must be loaded into memory for operator-based processing or analysis.

Theoretical Basis

The loading process follows the Strategy pattern:

# Abstract algorithm (NOT real implementation)
# 1. Determine source type from path/config
strategy = select_strategy(source_path, executor_type)

# 2. Load raw data using selected strategy
raw_data = strategy.load(source_path, **kwargs)

# 3. Handle multi-source concatenation
if multiple_sources:
    raw_data = concatenate(raw_data_list)

# 4. Wrap in framework dataset abstraction
dataset = wrap_dataset(raw_data, executor_type)
# Returns NestedDataset (default) or RayDataset (distributed)

Related Pages

Implemented By

Implementation:Datajuicer_Data_juicer_DatasetBuilder_Load_Dataset

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment