Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datajuicer Data juicer Dataset Loading

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ETL
Last Updated 2026-02-14 17:00 GMT

Overview

A strategy-based data ingestion pattern that loads datasets from heterogeneous sources into a unified in-memory representation for processing.

Description

Dataset Loading abstracts the complexity of reading data from multiple source types (local files, remote URLs, HuggingFace Hub, S3 storage) into a single unified interface. It employs the Strategy pattern to select the appropriate loading mechanism based on the source path format, then wraps the result in a framework-specific dataset abstraction that supports nested field access, multimodal data, and lazy evaluation. This solves the problem of handling diverse data formats (JSONL, Parquet, CSV, JSON) and sources transparently.

Usage

Use this principle as the second step in any Data-Juicer pipeline, immediately after Configuration Initialization. It is required whenever raw data must be loaded into memory for operator-based processing or analysis.

Theoretical Basis

The loading process follows the Strategy pattern:

# Abstract algorithm (NOT real implementation)
# 1. Determine source type from path/config
strategy = select_strategy(source_path, executor_type)

# 2. Load raw data using selected strategy
raw_data = strategy.load(source_path, **kwargs)

# 3. Handle multi-source concatenation
if multiple_sources:
    raw_data = concatenate(raw_data_list)

# 4. Wrap in framework dataset abstraction
dataset = wrap_dataset(raw_data, executor_type)
# Returns NestedDataset (default) or RayDataset (distributed)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment