Principle:Huggingface Datatrove Ray Distributed Execution
| Knowledge Sources | |
|---|---|
| Domains | Distributed Computing, Data Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Ray Distributed Execution is the principle of distributing data processing pipeline tasks across a Ray cluster using placement groups, fault-tolerant task scheduling, and automatic retry mechanisms.
Description
In large-scale data processing, workloads must be distributed across many machines to achieve reasonable throughput. Ray provides a framework for distributed computing that abstracts away the complexity of resource management, task scheduling, and inter-node communication. The Ray Distributed Execution principle in datatrove leverages Ray's placement groups to ensure that tasks receive the exact resource allocation they need (CPUs, GPUs, memory) and that multi-node tasks are co-located on the correct set of machines.
A key aspect of this principle is fault tolerance. In cloud environments, worker nodes may be preempted or crash unexpectedly. The execution model handles this through retriable error classification (distinguishing Ray-level failures like preemption, actor crashes, and object loss from application-level errors) and automatic resubmission of failed tasks up to a configurable limit. Task timeouts provide an additional safety net, terminating and rescheduling tasks that exceed expected runtime.
Usage
Apply this principle when building distributed data pipelines that need to run reliably on elastic or spot-instance-based Ray clusters, where fault tolerance, resource management, and multi-node coordination are required.
Theoretical Basis
The Ray Distributed Execution approach in datatrove is built on several key concepts:
- Placement Groups: Ray's mechanism for reserving bundles of resources across nodes. Each task group reserves
nodes_per_taskbundles, ensuring resource isolation and co-location for multi-node workloads.
- Actor-Based Execution: Pipeline tasks run inside Ray actors (RankWorker), which maintain state across method calls and can be placed on specific nodes within a placement group. Each actor spawns a local process pool for running multiple ranks.
- Asynchronous Task Management: Placement group creation and task dispatch happen asynchronously via a ThreadPoolExecutor, preventing the main scheduling loop from blocking while waiting for resource allocation.
- Retriable Error Classification: The system distinguishes between retriable errors (WorkerCrashedError, TaskCancelledError, ObjectLostError, RayActorError with preemption) and non-retriable application errors, only resubmitting tasks that failed due to infrastructure issues.
- Dependency Chaining: Executors can declare dependencies on other executors via the
dependsparameter, enabling multi-stage pipeline orchestration where later stages wait for earlier stages to complete.