Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datajuicer Data juicer Distributed Dependency Setup

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, DevOps
Last Updated 2026-02-14 17:00 GMT

Overview

An environment preparation pattern that installs additional packages required for distributed data processing on Ray clusters.

Description

Distributed Dependency Setup ensures that the Python environment has all required packages for Ray-based distributed processing. Data-Juicer defines optional dependency groups (extras) in its package specification that install Ray and related packages. This separation keeps the base installation lightweight while enabling distributed functionality on demand. The ray extra includes core Ray packages, while ray_video adds video deduplication support.

Usage

Use this principle before any distributed Ray workflow. It is a one-time setup step per environment. Required when using executor_type: ray or executor_type: ray_partitioned in the pipeline configuration.

Theoretical Basis

Python packaging extras allow conditional dependency installation:

# Abstract pattern (NOT real implementation)
# Base install: minimal dependencies
pip install data-juicer

# Distributed install: adds ray[data], pydantic
pip install "data-juicer[ray]"

# Video distributed: adds video processing deps
pip install "data-juicer[ray_video]"

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment