Implementation:Alibaba ROLL Cluster

Knowledge Sources	Alibaba ROLL Ray Actors
Domains	Distributed_Systems, GPU_Computing
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete distributed cluster manager for Ray-based GPU worker groups provided by the Alibaba ROLL library.

Description

The Cluster class manages a group of Ray actors representing workers of the same role. It handles worker creation with placement group scheduling, method binding for distributed dispatch, and provides synchronous/asynchronous execution interfaces. Combined with ResourceManager for GPU allocation, it forms the distributed execution backbone of all ROLL pipelines.

Usage

Import this class when initializing distributed workers for any ROLL pipeline. Each pipeline creates multiple Cluster instances for different roles (actor_train, actor_infer, reference, critic, reward).

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/distributed/executor/cluster.py
Lines: L32-265

Signature

class Cluster:
    def __init__(
        self,
        name: str,
        worker_cls: Union[Type[Worker], str],
        resource_manager: ResourceManager,
        worker_config: WorkerConfig,
    ) -> None:
        """
        Initialize Cluster with Ray actor workers.

        Args:
            name: Cluster name (e.g., 'actor_train', 'reward-math')
            worker_cls: Worker class or string path for Ray remote actors
            resource_manager: GPU/CPU resource allocation manager
            worker_config: Configuration for all workers in this cluster
        """

    def execute_rank_zero_sync(self, method_name: str, *args, **kwargs) -> Any:
        """Execute method on rank 0 worker synchronously."""

    def execute_rank_zero_async(self, method_name: str, *args, **kwargs) -> Any:
        """Execute method on rank 0 worker asynchronously."""

    def execute_all_sync(self, method_name: str, *args, **kwargs) -> List[Any]:
        """Execute method on all workers synchronously."""

    def execute_all_async(self, method_name: str, *args, **kwargs) -> List[Any]:
        """Execute method on all workers asynchronously."""

Import

from roll.distributed.executor.cluster import Cluster
from roll.distributed.scheduler.resource_manager import ResourceManager

I/O Contract

Inputs

Name	Type	Required	Description
name	str	Yes	Cluster name identifier
worker_cls	Union[Type[Worker], str]	Yes	Worker class or fully qualified string path
resource_manager	ResourceManager	Yes	GPU resource allocation manager
worker_config	WorkerConfig	Yes	Worker configuration with device_mapping, strategy_args

Outputs

Name	Type	Description
Cluster instance	Cluster	Initialized cluster with Ray actor workers deployed on assigned GPUs
dp_size	int	Data parallel size of the cluster
tp_size	int	Tensor parallel size of the cluster

Usage Examples

Creating Actor Training Cluster

from roll.distributed.executor.cluster import Cluster
from roll.distributed.scheduler.resource_manager import ResourceManager

# 1. Create resource manager for the Ray cluster
resource_manager = ResourceManager(num_gpus_per_node=8, num_nodes=4)

# 2. Create actor training cluster
actor_train = Cluster(
    name="actor_train",
    worker_cls="roll.pipeline.rlvr.actor_worker.ActorWorker",
    resource_manager=resource_manager,
    worker_config=rlvr_config.actor_train,
)

# 3. Initialize all workers
actor_train.execute_all_sync("initialize", rlvr_config)

# 4. Execute training step on all workers
results = actor_train.execute_all_sync("train_step", batch)

Related Pages

Implements Principle

Principle:Alibaba_ROLL_Distributed_Worker_Cluster_Management

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Heuristic:Alibaba_ROLL_GPU_Memory_Offload_Strategy

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment