Workflow:Apache Dolphinscheduler Workflow Failover Recovery

Knowledge Sources	Apache DolphinScheduler DolphinScheduler Docs
Domains	High_Availability, Distributed_Systems, Fault_Tolerance
Last Updated	2026-02-10 10:00 GMT

Overview

End-to-end process for detecting node failures and recovering workflow execution in DolphinScheduler's multi-Master, multi-Worker distributed architecture.

Description

This workflow describes how DolphinScheduler handles failures in its distributed cluster, including Master node crashes, Worker node failures, and task execution errors. The system uses a registry-based (ZooKeeper/etcd) cluster management layer where nodes register heartbeats. When a node becomes unavailable, the failover mechanism redistributes its workload to healthy nodes. The process covers failure detection, workflow instance recovery, task reassignment, and state reconciliation. DolphinScheduler supports recovering from failure tasks, suspended tasks, and complete Master failover scenarios.

Usage

Execute this workflow when you need to understand or configure the high-availability behavior of a DolphinScheduler cluster, troubleshoot failover scenarios, or design deployment architectures that maximize uptime and minimize data pipeline disruption during infrastructure failures.

Execution Steps

Step 1: Cluster Registration and Health Monitoring

All Master and Worker nodes register themselves with the cluster registry upon startup. Each node maintains a periodic heartbeat to indicate liveness. The cluster management layer (MasterClusters, WorkerClusters) tracks active nodes, their resource capacity, and their current workload through slot assignment.

Key considerations:

Nodes register with host:port identifiers in the registry
Heartbeat interval and timeout are configurable
SlotManager assigns execution slots to Masters for fair work distribution
Worker groups organize Workers by capability (e.g., GPU, high-memory)

Step 2: Detect Node Failure

When a node's heartbeat expires (missed heartbeat threshold exceeded), the cluster manager raises a failure event. The detection mechanism distinguishes between Master failures and Worker failures, as each requires different recovery procedures. Network partitions are handled through registry session semantics.

Key considerations:

Heartbeat timeout must balance between fast detection and false positives
Registry session loss triggers immediate failover consideration
Split-brain scenarios are mitigated by the registry's consistency guarantees
Failure events are propagated to all remaining cluster members

Step 3: Initiate Failover Process

Upon detecting a Master failure, a surviving Master initiates the WorkflowFailover process. This involves identifying all workflow instances that were being managed by the failed Master, locking them to prevent concurrent recovery, and preparing them for reassignment. The failover command is processed through the same command pipeline as normal workflow operations.

Key considerations:

Only one surviving Master handles the failover to prevent conflicts
Workflow instances are identified by their assigned Master host
In-flight tasks on the failed Master need state reconciliation
The failover process itself is idempotent to handle partial failures

Step 4: Recover Workflow Instances

Each affected workflow instance is recovered based on its state at the time of failure. Running workflows are resumed from their current task positions. The recovery mechanism uses IWorkflowControlClient's triggerFromFailureTasks and triggerFromSuspendTasks methods to restart execution from the appropriate checkpoint, avoiding re-execution of already completed tasks.

Key considerations:

Completed tasks are not re-executed; only pending/running tasks are recovered
Task output parameters from completed tasks are preserved
Sub-workflow instances follow the same recovery pattern recursively
Recovery respects the original workflow's failure strategy and priority settings

Step 5: Reassign Tasks and Resume Execution

Recovered tasks are re-dispatched to available Workers through the normal task dispatch pipeline. The load balancer considers the updated cluster topology (minus the failed node) when selecting target Workers. Execution resumes with the same parameters and environment as the original run.

Key considerations:

Worker failure recovery reassigns only the affected tasks, not entire workflows
Task retry counts are respected during reassignment
Resource limits and worker group constraints are re-evaluated
Dispatch events flow through the TaskDispatchableEventBus

Step 6: Reconcile State and Verify Completion

After recovery, the system reconciles the state of all affected workflow and task instances. This includes verifying that all tasks have reached a terminal state, updating workflow instance metadata to reflect the new Master assignment, and generating appropriate alerts for any workflows that could not be recovered.

Key considerations:

State reconciliation handles edge cases like tasks completed during failover
Audit trail records the failover event and recovery actions
Alert notifications inform administrators of failover events
Integration tests validate common failover scenarios using YAML-defined test cases

Execution Diagram

GitHub URL

Workflow Repository