Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ArroyoSystems Arroyo Failure Detection

From Leeroopedia


Template:Principle

Failure Detection

Principle: Detecting worker and task failures in a distributed streaming system. Uses heartbeat-based liveness monitoring and task status tracking to identify failures and trigger recovery.

Theoretical Basis

Failure detection in distributed systems is a fundamental problem studied extensively in the distributed computing literature. Arroyo employs a dual-mode failure detection strategy combining heartbeat-based liveness monitoring with event-driven task failure reporting.

Heartbeat-Based Liveness Monitoring

Each worker periodically sends heartbeat messages to the controller, serving as proof of liveness:

  • Protocol: Workers send heartbeats at a regular interval (e.g., every few seconds).
  • Timeout: The controller declares a worker failed if no heartbeat is received within a configurable timeout period.
  • Failure mode detected: Crash failures -- the worker process has terminated, the network is partitioned, or the worker is unresponsive.

The heartbeat protocol provides an eventually perfect failure detector (class Diamond-P in the Chandra-Toueg hierarchy):

  • Completeness: Every crashed worker is eventually detected (after the timeout period).
  • Accuracy: A live worker is not falsely suspected (given a sufficiently long timeout relative to network delays).

Task Failure Detection (Event-Driven)

In addition to heartbeat monitoring, task-level failures are reported immediately by workers:

  • Protocol: When a task encounters a fatal error (e.g., operator panic, serialization failure, resource exhaustion), the worker reports the failure event to the controller.
  • Failure mode detected: Logical errors -- the worker process is alive, but a specific task within it has failed.

Combining Both Mechanisms

The dual-mode approach is necessary because neither mechanism alone is sufficient:

Mechanism Detects Latency Limitation
Heartbeat timeout Crash failures, network partitions Timeout period (seconds) Cannot detect logical task failures
Task failure events Operator errors, resource exhaustion Immediate Cannot detect process crashes or network partitions

Recovery Trigger

Upon detecting a failure (from either mechanism), the controller initiates recovery:

  1. The failed job transitions to a recovery state.
  2. The controller identifies the last successful checkpoint.
  3. Workers are rescheduled and state is restored from the checkpoint.

Domains

  • Stream_Processing: Failure detection is essential for maintaining continuous stream processing in the presence of faults.
  • Distributed_Systems: Heartbeat-based failure detection is a core distributed systems primitive.
  • Fault_Tolerance: Timely failure detection is a prerequisite for fast recovery.

Related Implementation

Implementation:ArroyoSystems_Arroyo_Failure_Detector Heuristic:ArroyoSystems_Arroyo_Worker_Heartbeat_Timeout

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment