Principle:Apache Dolphinscheduler Cluster Health Monitoring
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, Cluster_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A registry-based cluster health monitoring system that tracks master and worker node availability through heartbeat subscriptions, maintaining a live topology map with slot-based work distribution.
Description
The Cluster Health Monitoring principle defines how DolphinScheduler maintains awareness of all nodes in the distributed cluster. MasterClusters and WorkerClusters subscribe to a service registry (e.g., ZooKeeper) for heartbeat events. When nodes join, leave, or update their status, the cluster objects update their internal maps and notify registered listeners. MasterSlotManager uses the master cluster topology to assign work slots, ensuring even distribution of workflow processing across master instances.
Worker clusters additionally maintain a worker-group-to-address mapping, enabling the dispatcher to find available workers for each group. The ClusterManager orchestrates the initialization of both cluster objects at startup.
Usage
Cluster health monitoring is automatically initialized by the master server via ClusterManager.start(). It runs continuously in the background, processing registry events and maintaining the live topology.
Theoretical Basis
The monitoring follows the Observer Pattern over a Service Registry:
- Registry: External service (ZooKeeper) stores node heartbeats
- Subscriber: AbstractClusterSubscribeListener receives ADD/REMOVE/UPDATE events
- Topology Map: MasterClusters and WorkerClusters maintain live node sets
- Slot Management: MasterSlotManager assigns slots based on master count
// Monitoring flow
RegistryClient.subscribe("/masters", masterCluster)
RegistryClient.subscribe("/workers", workerCluster)
// On heartbeat event:
onServerAdded(metadata): // node joined cluster
serverMap.put(address, metadata)
notifyListeners(ADD)
rebalanceSlots()
onServerRemove(metadata): // node left/crashed
serverMap.remove(address)
notifyListeners(REMOVE)
rebalanceSlots()
triggerFailover(metadata)