Principle:Datajuicer Data juicer Operator Fusion

Knowledge Sources	Data-Juicer
Domains	Data_Engineering, Performance_Optimization
Last Updated	2026-02-14 17:00 GMT

Overview

An optimization technique that merges consecutive compatible operators into fused units to reduce redundant data passes and intermediate computations.

Description

Operator Fusion analyzes a sequence of data processing operators and identifies groups that can share intermediate results. For example, consecutive Filter operators that compute overlapping statistics can be fused into a single FusedFilter that computes all statistics in one pass and applies all filter predicates together. This reduces the number of dataset scans from N (one per filter) to 1 (one fused pass), providing significant speedup for pipelines with many filters. Additionally, adaptive workload balancing probes operator performance to calculate optimal batch sizes per operator.

Usage

Use this principle after operator instantiation and before data processing to optimize execution. It is particularly beneficial when the pipeline contains multiple consecutive filters that share intermediate variables (e.g., tokenized text, word counts).

Theoretical Basis

# Abstract algorithm (NOT real implementation)
fused_ops = []
current_group = []

for op in operators:
    if can_fuse(current_group, op):
        current_group.append(op)
    else:
        if len(current_group) > 1:
            fused_ops.append(FusedFilter(current_group))
        else:
            fused_ops.extend(current_group)
        current_group = [op]

# Flush remaining group
flush(current_group, fused_ops)

# Optionally reorder by probed speed for efficiency
if speed_probes_available:
    fused_ops = reorder_by_speed(fused_ops, speed_probes)

return fused_ops

The fusion criterion checks that operators are all Filters and that consecutive filters share intermediate computation variables.

Related Pages

Implemented By

Implementation:Datajuicer_Data_juicer_Fuse_Operators

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment