Principle:Apache Dolphinscheduler Node Failure Detection
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, Fault_Tolerance |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A heartbeat-based failure detection mechanism that detects node crashes through registry event subscriptions and triggers failover processes for affected workflows and tasks.
Description
The Node Failure Detection principle uses a service registry's session-based heartbeat mechanism to detect when nodes become unavailable. When a master or worker node stops sending heartbeats (due to crash, network partition, or graceful shutdown), the registry fires a REMOVE event. The AbstractClusterSubscribeListener receives this event, parses the node metadata from the heartbeat JSON, and invokes onServerRemove() on the appropriate cluster object (MasterClusters or WorkerClusters). All registered IClustersChangeListener instances are then notified, triggering failover procedures.
Usage
Failure detection is automatic and requires no application-level configuration. The registry client subscription is set up during cluster initialization.
Theoretical Basis
The detection follows the Unreliable Failure Detector model:
- Heartbeat: Nodes periodically register heartbeats with the registry
- Session Timeout: Registry detects absence of heartbeat after configurable timeout
- REMOVE Event: Fired when the node's session expires
- Listener Notification: All registered listeners are notified of the failure
// Detection flow
notify(event):
if event.type == REMOVE:
metadata = parseServerFromHeartbeat(event.data)
onServerRemove(metadata)
for listener in changeListeners:
listener.onServerRemove(metadata)