Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:NVIDIA NeMo Curator Ray Cluster

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Distributed_Computing
Last Updated 2026-02-14 16:45 GMT

Overview

Ray distributed computing cluster environment with Xenna executor integration for running NeMo Curator pipelines at scale.

Description

NeMo Curator uses Ray as its primary distributed execution backend. The `RayClient` manages cluster connections, and the Cosmos-Xenna framework (via `XennaExecutor`) provides the default executor for pipeline stages. The cluster can run locally (single-node) or across multiple nodes. NeMo Curator automatically configures Ray environment variables at import time to ensure compatibility with the Xenna executor.

Usage

Required for all pipeline execution in NeMo Curator. Even single-node usage initializes a local Ray cluster. Multi-node deployments require a pre-configured Ray cluster with the head node accessible via `RAY_ADDRESS`.

System Requirements

Category Requirement Notes
OS Linux Ray on Linux required by NeMo Curator
Network Open ports: 6379, 8265, 8080, 10001-19999 Ray head, dashboard, metrics, and worker ports
Memory 8GB+ RAM per node Ray object store uses shared memory
Disk `/tmp/ray` writable Default Ray temp directory

Dependencies

Python Packages

  • `ray[default,data]` >= 2.50
  • `cosmos-xenna` == 0.1.2

Credentials

The following environment variables configure the Ray cluster:

  • `RAY_ADDRESS`: Address of the Ray head node (e.g., `ray://head-node:10001`). If set, NeMo Curator connects to an existing cluster instead of starting a local one.
  • `CURATOR_IGNORE_RAY_HEAD_NODE`: Optional flag to ignore Ray head node scheduling constraints.
  • `RAPIDS_NO_INITIALIZE`: Set to `1` automatically by NeMo Curator to prevent premature RAPIDS initialization.
  • `RAY_MAX_LIMIT_FROM_API_SERVER`: Set automatically from Cosmos-Xenna API_LIMIT.
  • `RAY_MAX_LIMIT_FROM_DATA_SOURCE`: Set automatically from Cosmos-Xenna API_LIMIT.

Quick Install

# Ray is included in the base nemo-curator install
pip install nemo-curator

# To start a local Ray cluster manually:
ray start --head --port=6379 --dashboard-host=0.0.0.0

Code Evidence

Default port configuration from `nemo_curator/core/constants.py:15-26`:

DEFAULT_RAY_PORT = 6379
DEFAULT_RAY_DASHBOARD_PORT = 8265
DEFAULT_RAY_TEMP_DIR = "/tmp/ray"
DEFAULT_RAY_METRICS_PORT = 8080
DEFAULT_RAY_DASHBOARD_HOST = "127.0.0.1"
DEFAULT_RAY_CLIENT_SERVER_PORT = 10001
DEFAULT_RAY_AUTOSCALER_METRIC_PORT = 44217
DEFAULT_RAY_DASHBOARD_METRIC_PORT = 44227

# We cannot use a free port between 10000 and 19999 as it is used by Ray.
DEFAULT_RAY_MIN_WORKER_PORT = 10002
DEFAULT_RAY_MAX_WORKER_PORT = 19999

RAY_ADDRESS detection from `nemo_curator/core/client.py:119`:

# Check if Ray is already running via RAY_ADDRESS env var
ray_address = os.environ.get("RAY_ADDRESS")

Automatic env var configuration from `nemo_curator/__init__.py:34-38`:

from cosmos_xenna.ray_utils.cluster import API_LIMIT
os.environ["RAY_MAX_LIMIT_FROM_API_SERVER"] = str(API_LIMIT)
os.environ["RAY_MAX_LIMIT_FROM_DATA_SOURCE"] = str(API_LIMIT)

Common Errors

Error Message Cause Solution
`ConnectionRefusedError` on Ray client connect Ray head node not running Start Ray with `ray start --head` or set `RAY_ADDRESS`
`RAY_ADDRESS already set in environment` (warning) Conflicting Ray address configuration Clear `RAY_ADDRESS` or ensure it points to the correct cluster
Port conflicts on 6379 Another service using the Ray default port Change Ray port with `--port` flag
Ray object store OOM Insufficient shared memory Increase `/dev/shm` size or set `--object-store-memory`

Compatibility Notes

  • Port range 10000-19999: Reserved by Ray for worker communication. Do not bind other services to these ports.
  • Single-node mode: NeMo Curator auto-starts a local Ray cluster if no `RAY_ADDRESS` is set.
  • Multi-node: All nodes must have the same NeMo Curator version and compatible RAPIDS/CUDA stack.
  • Dashboard: Accessible at `http://localhost:8265` by default. Change host with `DEFAULT_RAY_DASHBOARD_HOST`.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment