Principle:ArroyoSystems Arroyo Failure Detection

Failure Detection

Principle: Detecting worker and task failures in a distributed streaming system. Uses heartbeat-based liveness monitoring and task status tracking to identify failures and trigger recovery.

Theoretical Basis

Failure detection in distributed systems is a fundamental problem studied extensively in the distributed computing literature. Arroyo employs a dual-mode failure detection strategy combining heartbeat-based liveness monitoring with event-driven task failure reporting.

Heartbeat-Based Liveness Monitoring

Each worker periodically sends heartbeat messages to the controller, serving as proof of liveness:

Protocol: Workers send heartbeats at a regular interval (e.g., every few seconds).
Timeout: The controller declares a worker failed if no heartbeat is received within a configurable timeout period.
Failure mode detected: Crash failures -- the worker process has terminated, the network is partitioned, or the worker is unresponsive.

The heartbeat protocol provides an eventually perfect failure detector (class Diamond-P in the Chandra-Toueg hierarchy):

Completeness: Every crashed worker is eventually detected (after the timeout period).
Accuracy: A live worker is not falsely suspected (given a sufficiently long timeout relative to network delays).

Task Failure Detection (Event-Driven)

In addition to heartbeat monitoring, task-level failures are reported immediately by workers:

Protocol: When a task encounters a fatal error (e.g., operator panic, serialization failure, resource exhaustion), the worker reports the failure event to the controller.
Failure mode detected: Logical errors -- the worker process is alive, but a specific task within it has failed.

Combining Both Mechanisms

The dual-mode approach is necessary because neither mechanism alone is sufficient:

Mechanism	Detects	Latency	Limitation
Heartbeat timeout	Crash failures, network partitions	Timeout period (seconds)	Cannot detect logical task failures
Task failure events	Operator errors, resource exhaustion	Immediate	Cannot detect process crashes or network partitions

Recovery Trigger

Upon detecting a failure (from either mechanism), the controller initiates recovery:

The failed job transitions to a recovery state.
The controller identifies the last successful checkpoint.
Workers are rescheduled and state is restored from the checkpoint.

Domains

Stream_Processing: Failure detection is essential for maintaining continuous stream processing in the presence of faults.
Distributed_Systems: Heartbeat-based failure detection is a core distributed systems primitive.
Fault_Tolerance: Timely failure detection is a prerequisite for fast recovery.

Related Implementation

Implementation:ArroyoSystems_Arroyo_Failure_Detector Heuristic:ArroyoSystems_Arroyo_Worker_Heartbeat_Timeout

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment