Workflow:Apache Dolphinscheduler Workflow Failover Recovery
| Knowledge Sources | |
|---|---|
| Domains | High_Availability, Distributed_Systems, Fault_Tolerance |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
End-to-end process for detecting node failures and recovering workflow execution in DolphinScheduler's multi-Master, multi-Worker distributed architecture.
Description
This workflow describes how DolphinScheduler handles failures in its distributed cluster, including Master node crashes, Worker node failures, and task execution errors. The system uses a registry-based (ZooKeeper/etcd) cluster management layer where nodes register heartbeats. When a node becomes unavailable, the failover mechanism redistributes its workload to healthy nodes. The process covers failure detection, workflow instance recovery, task reassignment, and state reconciliation. DolphinScheduler supports recovering from failure tasks, suspended tasks, and complete Master failover scenarios.
Usage
Execute this workflow when you need to understand or configure the high-availability behavior of a DolphinScheduler cluster, troubleshoot failover scenarios, or design deployment architectures that maximize uptime and minimize data pipeline disruption during infrastructure failures.
Execution Steps
Step 1: Cluster Registration and Health Monitoring
All Master and Worker nodes register themselves with the cluster registry upon startup. Each node maintains a periodic heartbeat to indicate liveness. The cluster management layer (MasterClusters, WorkerClusters) tracks active nodes, their resource capacity, and their current workload through slot assignment.
Key considerations:
- Nodes register with host:port identifiers in the registry
- Heartbeat interval and timeout are configurable
- SlotManager assigns execution slots to Masters for fair work distribution
- Worker groups organize Workers by capability (e.g., GPU, high-memory)
Step 2: Detect Node Failure
When a node's heartbeat expires (missed heartbeat threshold exceeded), the cluster manager raises a failure event. The detection mechanism distinguishes between Master failures and Worker failures, as each requires different recovery procedures. Network partitions are handled through registry session semantics.
Key considerations:
- Heartbeat timeout must balance between fast detection and false positives
- Registry session loss triggers immediate failover consideration
- Split-brain scenarios are mitigated by the registry's consistency guarantees
- Failure events are propagated to all remaining cluster members
Step 3: Initiate Failover Process
Upon detecting a Master failure, a surviving Master initiates the WorkflowFailover process. This involves identifying all workflow instances that were being managed by the failed Master, locking them to prevent concurrent recovery, and preparing them for reassignment. The failover command is processed through the same command pipeline as normal workflow operations.
Key considerations:
- Only one surviving Master handles the failover to prevent conflicts
- Workflow instances are identified by their assigned Master host
- In-flight tasks on the failed Master need state reconciliation
- The failover process itself is idempotent to handle partial failures
Step 4: Recover Workflow Instances
Each affected workflow instance is recovered based on its state at the time of failure. Running workflows are resumed from their current task positions. The recovery mechanism uses IWorkflowControlClient's triggerFromFailureTasks and triggerFromSuspendTasks methods to restart execution from the appropriate checkpoint, avoiding re-execution of already completed tasks.
Key considerations:
- Completed tasks are not re-executed; only pending/running tasks are recovered
- Task output parameters from completed tasks are preserved
- Sub-workflow instances follow the same recovery pattern recursively
- Recovery respects the original workflow's failure strategy and priority settings
Step 5: Reassign Tasks and Resume Execution
Recovered tasks are re-dispatched to available Workers through the normal task dispatch pipeline. The load balancer considers the updated cluster topology (minus the failed node) when selecting target Workers. Execution resumes with the same parameters and environment as the original run.
Key considerations:
- Worker failure recovery reassigns only the affected tasks, not entire workflows
- Task retry counts are respected during reassignment
- Resource limits and worker group constraints are re-evaluated
- Dispatch events flow through the TaskDispatchableEventBus
Step 6: Reconcile State and Verify Completion
After recovery, the system reconciles the state of all affected workflow and task instances. This includes verifying that all tasks have reached a terminal state, updating workflow instance metadata to reflect the new Master assignment, and generating appropriate alerts for any workflows that could not be recovered.
Key considerations:
- State reconciliation handles edge cases like tasks completed during failover
- Audit trail records the failover event and recovery actions
- Alert notifications inform administrators of failover events
- Integration tests validate common failover scenarios using YAML-defined test cases