Implementation:Datajuicer Data juicer Ray Start Cluster
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Infrastructure |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
External tool documentation for starting and configuring a Ray cluster for distributed Data-Juicer processing.
Description
Ray cluster initialization uses the ray start CLI or ray.init() Python API. Data-Juicer includes a diagnostic tool at tools/check_ray_cluster.py that verifies cluster connectivity and resource availability. The cluster address is configured via the ray_address config parameter.
Usage
Start a Ray cluster before running any distributed pipeline. Use tools/check_ray_cluster.py to verify the cluster is ready.
Code Reference
Source Location
- Repository: data-juicer
- File: tools/check_ray_cluster.py (diagnostic tool)
- Lines: L1-94
Commands
# Start head node
ray start --head --port=6379 --num-cpus=32 --num-gpus=4
# Start worker node (on other machines)
ray start --address=HEAD_IP:6379
# Verify cluster
python tools/check_ray_cluster.py
# Stop cluster
ray stop
Python API
import ray
# Local single-machine cluster
ray.init()
# Connect to existing cluster
ray.init(address='ray://HEAD_IP:10001')
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --head | flag | Yes (head node) | Designate as head node |
| --address | str | Yes (workers) | Head node address to connect to |
| --num-cpus | int | No | CPU resource allocation |
| --num-gpus | int | No | GPU resource allocation |
Outputs
| Name | Type | Description |
|---|---|---|
| cluster | Ray runtime | Running Ray cluster accessible via ray:// address |
| dashboard | Web UI | Ray dashboard at HEAD_IP:8265 |
Usage Examples
Single Machine Setup
# Start local cluster and run pipeline
ray start --head
python tools/process_data.py --config ray_pipeline.yaml
ray stop
Multi-Node Cluster
# On head node (192.168.1.100)
ray start --head --port=6379
# On worker nodes
ray start --address=192.168.1.100:6379
# Verify from any node
python tools/check_ray_cluster.py
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment