Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer Ray Start Cluster

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, Infrastructure
Last Updated 2026-02-14 17:00 GMT

Overview

External tool documentation for starting and configuring a Ray cluster for distributed Data-Juicer processing.

Description

Ray cluster initialization uses the ray start CLI or ray.init() Python API. Data-Juicer includes a diagnostic tool at tools/check_ray_cluster.py that verifies cluster connectivity and resource availability. The cluster address is configured via the ray_address config parameter.

Usage

Start a Ray cluster before running any distributed pipeline. Use tools/check_ray_cluster.py to verify the cluster is ready.

Code Reference

Source Location

  • Repository: data-juicer
  • File: tools/check_ray_cluster.py (diagnostic tool)
  • Lines: L1-94

Commands

# Start head node
ray start --head --port=6379 --num-cpus=32 --num-gpus=4

# Start worker node (on other machines)
ray start --address=HEAD_IP:6379

# Verify cluster
python tools/check_ray_cluster.py

# Stop cluster
ray stop

Python API

import ray

# Local single-machine cluster
ray.init()

# Connect to existing cluster
ray.init(address='ray://HEAD_IP:10001')

I/O Contract

Inputs

Name Type Required Description
--head flag Yes (head node) Designate as head node
--address str Yes (workers) Head node address to connect to
--num-cpus int No CPU resource allocation
--num-gpus int No GPU resource allocation

Outputs

Name Type Description
cluster Ray runtime Running Ray cluster accessible via ray:// address
dashboard Web UI Ray dashboard at HEAD_IP:8265

Usage Examples

Single Machine Setup

# Start local cluster and run pipeline
ray start --head
python tools/process_data.py --config ray_pipeline.yaml
ray stop

Multi-Node Cluster

# On head node (192.168.1.100)
ray start --head --port=6379

# On worker nodes
ray start --address=192.168.1.100:6379

# Verify from any node
python tools/check_ray_cluster.py

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment