Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer Pip Install Ray Extras

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, DevOps
Last Updated 2026-02-14 17:00 GMT

Overview

External tool documentation for installing Data-Juicer's distributed processing dependencies via pip extras.

Description

Data-Juicer's pyproject.toml defines optional dependency groups (extras) that install Ray and related packages. The ray extra installs ray[data] and pydantic for distributed data processing. The ray_video extra additionally includes video deduplication dependencies.

Usage

Run the appropriate pip install command before using any Ray-based executor. This is a one-time setup step per Python environment.

Code Reference

Source Location

  • Repository: data-juicer
  • File: pyproject.toml
  • Lines: L1-234 (extras definitions)

Commands

# Base distributed processing
pip install "data-juicer[ray]"

# With video deduplication support
pip install "data-juicer[ray_video]"

# Full install with all extras
pip install "data-juicer[all]"

Import

# Verify installation
import ray
import ray.data

I/O Contract

Inputs

Name Type Required Description
extras group str Yes Package extras name: 'ray', 'ray_video', or 'all'

Outputs

Name Type Description
installed packages Python packages ray[data], pydantic, and related dependencies in the environment

Usage Examples

Install and Verify

# Install ray extras
pip install "data-juicer[ray]"

# Verify installation
python -c "import ray; print(ray.__version__)"
python -c "from data_juicer.core.executor import RayExecutor; print('OK')"

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment