Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datajuicer Data juicer Distributed Pipeline Execution

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, Data_Engineering
Last Updated 2026-02-14 17:00 GMT

Overview

A distributed execution pattern that applies data processing operators across partitioned data on a Ray cluster with support for convergence points and global operations.

Description

Distributed Pipeline Execution runs Data-Juicer operator pipelines at scale by distributing data across Ray cluster workers. It supports two modes: RayExecutor (basic, streams the entire dataset through operators) and PartitionedRayExecutor (advanced, splits data into partitions that are processed independently with convergence points for global operations like deduplication). The partitioned executor supports DAG-based execution scheduling, checkpointing for fault tolerance, and dynamic partition management.

Usage

Use this principle when processing datasets too large for a single machine, or when operators benefit from GPU parallelism across multiple nodes. Set executor_type: ray or executor_type: ray_partitioned in the pipeline config.

Theoretical Basis

# Abstract algorithm (NOT real implementation)
# Basic mode (RayExecutor):
ray_dataset = load_to_ray(dataset_path)
for op in operators:
    ray_dataset = ray_dataset.map(op.process)
ray_dataset.materialize()

# Partitioned mode (PartitionedRayExecutor):
partitions = ray_dataset.split(num_partitions)
for op in operators:
    if op.is_global:  # e.g., deduplicator
        # Convergence: merge partitions, apply globally, resplit
        merged = merge(partitions)
        merged = merged.map(op.process)
        partitions = merged.split(num_partitions)
    else:
        # Independent: process each partition separately
        partitions = [p.map(op.process) for p in partitions]

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment