Principle:ArroyoSystems Arroyo Failure Detection
Failure Detection
Principle: Detecting worker and task failures in a distributed streaming system. Uses heartbeat-based liveness monitoring and task status tracking to identify failures and trigger recovery.
Theoretical Basis
Failure detection in distributed systems is a fundamental problem studied extensively in the distributed computing literature. Arroyo employs a dual-mode failure detection strategy combining heartbeat-based liveness monitoring with event-driven task failure reporting.
Heartbeat-Based Liveness Monitoring
Each worker periodically sends heartbeat messages to the controller, serving as proof of liveness:
- Protocol: Workers send heartbeats at a regular interval (e.g., every few seconds).
- Timeout: The controller declares a worker failed if no heartbeat is received within a configurable timeout period.
- Failure mode detected: Crash failures -- the worker process has terminated, the network is partitioned, or the worker is unresponsive.
The heartbeat protocol provides an eventually perfect failure detector (class Diamond-P in the Chandra-Toueg hierarchy):
- Completeness: Every crashed worker is eventually detected (after the timeout period).
- Accuracy: A live worker is not falsely suspected (given a sufficiently long timeout relative to network delays).
Task Failure Detection (Event-Driven)
In addition to heartbeat monitoring, task-level failures are reported immediately by workers:
- Protocol: When a task encounters a fatal error (e.g., operator panic, serialization failure, resource exhaustion), the worker reports the failure event to the controller.
- Failure mode detected: Logical errors -- the worker process is alive, but a specific task within it has failed.
Combining Both Mechanisms
The dual-mode approach is necessary because neither mechanism alone is sufficient:
| Mechanism | Detects | Latency | Limitation |
|---|---|---|---|
| Heartbeat timeout | Crash failures, network partitions | Timeout period (seconds) | Cannot detect logical task failures |
| Task failure events | Operator errors, resource exhaustion | Immediate | Cannot detect process crashes or network partitions |
Recovery Trigger
Upon detecting a failure (from either mechanism), the controller initiates recovery:
- The failed job transitions to a recovery state.
- The controller identifies the last successful checkpoint.
- Workers are rescheduled and state is restored from the checkpoint.
Domains
- Stream_Processing: Failure detection is essential for maintaining continuous stream processing in the presence of faults.
- Distributed_Systems: Heartbeat-based failure detection is a core distributed systems primitive.
- Fault_Tolerance: Timely failure detection is a prerequisite for fast recovery.
Related Implementation
Implementation:ArroyoSystems_Arroyo_Failure_Detector Heuristic:ArroyoSystems_Arroyo_Worker_Heartbeat_Timeout