Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Dolphinscheduler Failover Process Initiation

From Leeroopedia


Knowledge Sources
Domains Distributed_Systems, Fault_Tolerance
Last Updated 2026-02-10 00:00 GMT

Overview

A coordinated failover initiation process that identifies affected workflows and tasks when a node fails, and creates recovery commands to resume execution on healthy nodes.

Description

The Failover Process Initiation principle defines how DolphinScheduler initiates recovery after a node failure. FailoverCoordinator responds to master and worker failure events: for master failures, it finds all workflow instances running on the crashed master and calls WorkflowFailover.failoverWorkflow() for each, which creates a Command with CommandType.RECOVER_TOLERANCE_FAULT_PROCESS in the command table. For worker failures, it identifies tasks running on the crashed worker and delegates to TaskFailover. A registry-based failover marker prevents duplicate failover processing.

Usage

Failover is automatically triggered by cluster change listeners when a node failure is detected. No manual intervention is required for standard failover scenarios.

Theoretical Basis

The failover follows a Command-based Recovery Pattern:

  • Detection: Cluster listener notifies FailoverCoordinator of node failure
  • Identification: Query database for workflows/tasks on the failed node
  • Command Creation: Insert recovery commands with WorkflowFailoverCommandParam
  • Execution: CommandEngine picks up recovery commands and re-processes workflows
  • Idempotency: Registry markers prevent duplicate failover
failoverMaster(event):
    workflows = findWorkflowsOnMaster(event.masterAddress)
    for workflow in workflows:
        WorkflowFailover.failoverWorkflow(workflow)
        // Creates RECOVER_TOLERANCE_FAULT_PROCESS command

failoverWorker(event):
    tasks = findTasksOnWorker(event.workerAddress)
    for task in tasks:
        TaskFailover.failoverTask(task)
        // Re-dispatches to healthy worker

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment