Principle:Datajuicer Data juicer Distributed Pipeline Execution
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Data_Engineering |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A distributed execution pattern that applies data processing operators across partitioned data on a Ray cluster with support for convergence points and global operations.
Description
Distributed Pipeline Execution runs Data-Juicer operator pipelines at scale by distributing data across Ray cluster workers. It supports two modes: RayExecutor (basic, streams the entire dataset through operators) and PartitionedRayExecutor (advanced, splits data into partitions that are processed independently with convergence points for global operations like deduplication). The partitioned executor supports DAG-based execution scheduling, checkpointing for fault tolerance, and dynamic partition management.
Usage
Use this principle when processing datasets too large for a single machine, or when operators benefit from GPU parallelism across multiple nodes. Set executor_type: ray or executor_type: ray_partitioned in the pipeline config.
Theoretical Basis
# Abstract algorithm (NOT real implementation)
# Basic mode (RayExecutor):
ray_dataset = load_to_ray(dataset_path)
for op in operators:
ray_dataset = ray_dataset.map(op.process)
ray_dataset.materialize()
# Partitioned mode (PartitionedRayExecutor):
partitions = ray_dataset.split(num_partitions)
for op in operators:
if op.is_global: # e.g., deduplicator
# Convergence: merge partitions, apply globally, resplit
merged = merge(partitions)
merged = merged.map(op.process)
partitions = merged.split(num_partitions)
else:
# Independent: process each partition separately
partitions = [p.map(op.process) for p in partitions]