Principle:Datajuicer Data juicer Dataset Loading
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ETL |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A strategy-based data ingestion pattern that loads datasets from heterogeneous sources into a unified in-memory representation for processing.
Description
Dataset Loading abstracts the complexity of reading data from multiple source types (local files, remote URLs, HuggingFace Hub, S3 storage) into a single unified interface. It employs the Strategy pattern to select the appropriate loading mechanism based on the source path format, then wraps the result in a framework-specific dataset abstraction that supports nested field access, multimodal data, and lazy evaluation. This solves the problem of handling diverse data formats (JSONL, Parquet, CSV, JSON) and sources transparently.
Usage
Use this principle as the second step in any Data-Juicer pipeline, immediately after Configuration Initialization. It is required whenever raw data must be loaded into memory for operator-based processing or analysis.
Theoretical Basis
The loading process follows the Strategy pattern:
# Abstract algorithm (NOT real implementation)
# 1. Determine source type from path/config
strategy = select_strategy(source_path, executor_type)
# 2. Load raw data using selected strategy
raw_data = strategy.load(source_path, **kwargs)
# 3. Handle multi-source concatenation
if multiple_sources:
raw_data = concatenate(raw_data_list)
# 4. Wrap in framework dataset abstraction
dataset = wrap_dataset(raw_data, executor_type)
# Returns NestedDataset (default) or RayDataset (distributed)