Principle:Datajuicer Data juicer Data Processing Execution

Knowledge Sources	Data-Juicer HuggingFace Datasets
Domains	Data_Engineering, Pipeline_Execution
Last Updated	2026-02-14 17:00 GMT

Overview

A sequential operator execution pattern that applies a chain of data processing operators to a dataset with support for checkpointing, tracing, and monitoring.

Description

Data Processing Execution is the core runtime loop that takes a loaded dataset and a list of instantiated operators, then applies each operator sequentially to the dataset. Each operator transforms the dataset through its process method (which internally calls dataset.map() for per-sample or batched transformations). The execution supports resumable checkpointing (skip already-completed operators), sample-level tracing (track which samples are removed and why), resource monitoring, and intermediate exports. This is the step where data actually gets cleaned, filtered, and transformed.

Usage

Use this principle as the main execution step after configuration, loading, instantiation, and optional fusion. This is where the actual data processing happens. The execution is orchestrated by the executor (DefaultExecutor) which sets up checkpointing, tracing, and monitoring before calling the dataset's process method.

Theoretical Basis

# Abstract algorithm (NOT real implementation)
for op in operators:
    # Check if operator was already completed (checkpoint)
    if checkpointer.should_skip(op):
        continue

    # Apply operator to dataset
    dataset = dataset.apply(op)
    # Internally: dataset.map(op.process, ...)

    # Save checkpoint after each operator
    checkpointer.save(dataset, op)

    # Export intermediate results if configured
    if exporter:
        exporter.export(dataset)

    # Trace removed samples
    if tracer:
        tracer.record(op, removed_samples)

return dataset

Related Pages

Implemented By

Implementation:Datajuicer_Data_juicer_NestedDataset_Process

Uses Heuristic

Heuristic:Datajuicer_Data_juicer_Batch_Size_Adaptation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment