Principle:Datajuicer Data juicer Data Processing Execution
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Pipeline_Execution |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A sequential operator execution pattern that applies a chain of data processing operators to a dataset with support for checkpointing, tracing, and monitoring.
Description
Data Processing Execution is the core runtime loop that takes a loaded dataset and a list of instantiated operators, then applies each operator sequentially to the dataset. Each operator transforms the dataset through its process method (which internally calls dataset.map() for per-sample or batched transformations). The execution supports resumable checkpointing (skip already-completed operators), sample-level tracing (track which samples are removed and why), resource monitoring, and intermediate exports. This is the step where data actually gets cleaned, filtered, and transformed.
Usage
Use this principle as the main execution step after configuration, loading, instantiation, and optional fusion. This is where the actual data processing happens. The execution is orchestrated by the executor (DefaultExecutor) which sets up checkpointing, tracing, and monitoring before calling the dataset's process method.
Theoretical Basis
# Abstract algorithm (NOT real implementation)
for op in operators:
# Check if operator was already completed (checkpoint)
if checkpointer.should_skip(op):
continue
# Apply operator to dataset
dataset = dataset.apply(op)
# Internally: dataset.map(op.process, ...)
# Save checkpoint after each operator
checkpointer.save(dataset, op)
# Export intermediate results if configured
if exporter:
exporter.export(dataset)
# Trace removed samples
if tracer:
tracer.record(op, removed_samples)
return dataset