Principle:Apache Dolphinscheduler Failover Process Initiation
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, Fault_Tolerance |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A coordinated failover initiation process that identifies affected workflows and tasks when a node fails, and creates recovery commands to resume execution on healthy nodes.
Description
The Failover Process Initiation principle defines how DolphinScheduler initiates recovery after a node failure. FailoverCoordinator responds to master and worker failure events: for master failures, it finds all workflow instances running on the crashed master and calls WorkflowFailover.failoverWorkflow() for each, which creates a Command with CommandType.RECOVER_TOLERANCE_FAULT_PROCESS in the command table. For worker failures, it identifies tasks running on the crashed worker and delegates to TaskFailover. A registry-based failover marker prevents duplicate failover processing.
Usage
Failover is automatically triggered by cluster change listeners when a node failure is detected. No manual intervention is required for standard failover scenarios.
Theoretical Basis
The failover follows a Command-based Recovery Pattern:
- Detection: Cluster listener notifies FailoverCoordinator of node failure
- Identification: Query database for workflows/tasks on the failed node
- Command Creation: Insert recovery commands with WorkflowFailoverCommandParam
- Execution: CommandEngine picks up recovery commands and re-processes workflows
- Idempotency: Registry markers prevent duplicate failover
failoverMaster(event):
workflows = findWorkflowsOnMaster(event.masterAddress)
for workflow in workflows:
WorkflowFailover.failoverWorkflow(workflow)
// Creates RECOVER_TOLERANCE_FAULT_PROCESS command
failoverWorker(event):
tasks = findTasksOnWorker(event.workerAddress)
for task in tasks:
TaskFailover.failoverTask(task)
// Re-dispatches to healthy worker