Principle:Datajuicer Data juicer Distributed Dependency Setup
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, DevOps |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
An environment preparation pattern that installs additional packages required for distributed data processing on Ray clusters.
Description
Distributed Dependency Setup ensures that the Python environment has all required packages for Ray-based distributed processing. Data-Juicer defines optional dependency groups (extras) in its package specification that install Ray and related packages. This separation keeps the base installation lightweight while enabling distributed functionality on demand. The ray extra includes core Ray packages, while ray_video adds video deduplication support.
Usage
Use this principle before any distributed Ray workflow. It is a one-time setup step per environment. Required when using executor_type: ray or executor_type: ray_partitioned in the pipeline configuration.
Theoretical Basis
Python packaging extras allow conditional dependency installation:
# Abstract pattern (NOT real implementation)
# Base install: minimal dependencies
pip install data-juicer
# Distributed install: adds ray[data], pydantic
pip install "data-juicer[ray]"
# Video distributed: adds video processing deps
pip install "data-juicer[ray_video]"