Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datajuicer Data juicer Data Processing Execution

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Pipeline_Execution
Last Updated 2026-02-14 17:00 GMT

Overview

A sequential operator execution pattern that applies a chain of data processing operators to a dataset with support for checkpointing, tracing, and monitoring.

Description

Data Processing Execution is the core runtime loop that takes a loaded dataset and a list of instantiated operators, then applies each operator sequentially to the dataset. Each operator transforms the dataset through its process method (which internally calls dataset.map() for per-sample or batched transformations). The execution supports resumable checkpointing (skip already-completed operators), sample-level tracing (track which samples are removed and why), resource monitoring, and intermediate exports. This is the step where data actually gets cleaned, filtered, and transformed.

Usage

Use this principle as the main execution step after configuration, loading, instantiation, and optional fusion. This is where the actual data processing happens. The execution is orchestrated by the executor (DefaultExecutor) which sets up checkpointing, tracing, and monitoring before calling the dataset's process method.

Theoretical Basis

# Abstract algorithm (NOT real implementation)
for op in operators:
    # Check if operator was already completed (checkpoint)
    if checkpointer.should_skip(op):
        continue

    # Apply operator to dataset
    dataset = dataset.apply(op)
    # Internally: dataset.map(op.process, ...)

    # Save checkpoint after each operator
    checkpointer.save(dataset, op)

    # Export intermediate results if configured
    if exporter:
        exporter.export(dataset)

    # Trace removed samples
    if tracer:
        tracer.record(op, removed_samples)

return dataset

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment