Principle:Datajuicer Data juicer Operator Fusion
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Performance_Optimization |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
An optimization technique that merges consecutive compatible operators into fused units to reduce redundant data passes and intermediate computations.
Description
Operator Fusion analyzes a sequence of data processing operators and identifies groups that can share intermediate results. For example, consecutive Filter operators that compute overlapping statistics can be fused into a single FusedFilter that computes all statistics in one pass and applies all filter predicates together. This reduces the number of dataset scans from N (one per filter) to 1 (one fused pass), providing significant speedup for pipelines with many filters. Additionally, adaptive workload balancing probes operator performance to calculate optimal batch sizes per operator.
Usage
Use this principle after operator instantiation and before data processing to optimize execution. It is particularly beneficial when the pipeline contains multiple consecutive filters that share intermediate variables (e.g., tokenized text, word counts).
Theoretical Basis
# Abstract algorithm (NOT real implementation)
fused_ops = []
current_group = []
for op in operators:
if can_fuse(current_group, op):
current_group.append(op)
else:
if len(current_group) > 1:
fused_ops.append(FusedFilter(current_group))
else:
fused_ops.extend(current_group)
current_group = [op]
# Flush remaining group
flush(current_group, fused_ops)
# Optionally reorder by probed speed for efficiency
if speed_probes_available:
fused_ops = reorder_by_speed(fused_ops, speed_probes)
return fused_ops
The fusion criterion checks that operators are all Filters and that consecutive filters share intermediate computation variables.