Principle:Apache Dolphinscheduler Node Failure Detection

Knowledge Sources	Apache DolphinScheduler Failure Detection
Domains	Distributed_Systems, Fault_Tolerance
Last Updated	2026-02-10 00:00 GMT

Overview

A heartbeat-based failure detection mechanism that detects node crashes through registry event subscriptions and triggers failover processes for affected workflows and tasks.

Description

The Node Failure Detection principle uses a service registry's session-based heartbeat mechanism to detect when nodes become unavailable. When a master or worker node stops sending heartbeats (due to crash, network partition, or graceful shutdown), the registry fires a REMOVE event. The AbstractClusterSubscribeListener receives this event, parses the node metadata from the heartbeat JSON, and invokes onServerRemove() on the appropriate cluster object (MasterClusters or WorkerClusters). All registered IClustersChangeListener instances are then notified, triggering failover procedures.

Usage

Failure detection is automatic and requires no application-level configuration. The registry client subscription is set up during cluster initialization.

Theoretical Basis

The detection follows the Unreliable Failure Detector model:

Heartbeat: Nodes periodically register heartbeats with the registry
Session Timeout: Registry detects absence of heartbeat after configurable timeout
REMOVE Event: Fired when the node's session expires
Listener Notification: All registered listeners are notified of the failure

// Detection flow
notify(event):
    if event.type == REMOVE:
        metadata = parseServerFromHeartbeat(event.data)
        onServerRemove(metadata)
        for listener in changeListeners:
            listener.onServerRemove(metadata)

Related Pages

Implemented By

Implementation:Apache_Dolphinscheduler_AbstractClusterSubscribeListener_Notify

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment