Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Dolphinscheduler Cluster Health Monitoring

From Leeroopedia


Knowledge Sources
Domains Distributed_Systems, Cluster_Management
Last Updated 2026-02-10 00:00 GMT

Overview

A registry-based cluster health monitoring system that tracks master and worker node availability through heartbeat subscriptions, maintaining a live topology map with slot-based work distribution.

Description

The Cluster Health Monitoring principle defines how DolphinScheduler maintains awareness of all nodes in the distributed cluster. MasterClusters and WorkerClusters subscribe to a service registry (e.g., ZooKeeper) for heartbeat events. When nodes join, leave, or update their status, the cluster objects update their internal maps and notify registered listeners. MasterSlotManager uses the master cluster topology to assign work slots, ensuring even distribution of workflow processing across master instances.

Worker clusters additionally maintain a worker-group-to-address mapping, enabling the dispatcher to find available workers for each group. The ClusterManager orchestrates the initialization of both cluster objects at startup.

Usage

Cluster health monitoring is automatically initialized by the master server via ClusterManager.start(). It runs continuously in the background, processing registry events and maintaining the live topology.

Theoretical Basis

The monitoring follows the Observer Pattern over a Service Registry:

  • Registry: External service (ZooKeeper) stores node heartbeats
  • Subscriber: AbstractClusterSubscribeListener receives ADD/REMOVE/UPDATE events
  • Topology Map: MasterClusters and WorkerClusters maintain live node sets
  • Slot Management: MasterSlotManager assigns slots based on master count
// Monitoring flow
RegistryClient.subscribe("/masters", masterCluster)
RegistryClient.subscribe("/workers", workerCluster)

// On heartbeat event:
onServerAdded(metadata):    // node joined cluster
    serverMap.put(address, metadata)
    notifyListeners(ADD)
    rebalanceSlots()

onServerRemove(metadata):   // node left/crashed
    serverMap.remove(address)
    notifyListeners(REMOVE)
    rebalanceSlots()
    triggerFailover(metadata)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment