Implementation:ArroyoSystems Arroyo Failure Detector
Failure Detector Implementation
Implementation of RunningJobModel::worker_timedout and task_failed.
These two methods on RunningJobModel implement the dual-mode failure detection strategy: heartbeat-based crash detection and event-driven task failure detection.
Code Reference
- File:
crates/arroyo-controller/src/job_controller/mod.rs(L553-L578)
Core Interface
impl RunningJobModel {
fn worker_timedout(&self) -> bool {
// Checks if any running worker's last_heartbeat exceeds timeout
}
fn task_failed(&self) -> Option<TaskFailedEvent> {
// Returns Some(event) if any task is in Failed state
}
}
Detailed Behavior
worker_timedout
Checks whether any running worker has exceeded the heartbeat timeout.
| Aspect | Detail |
|---|---|
| Return type | bool
|
Returns true |
If any worker's last_heartbeat timestamp is older than the configured timeout threshold
|
Returns false |
If all workers have sent a heartbeat within the timeout window |
Algorithm:
- Iterate over all registered workers in the
RunningJobModel. - For each worker that is in a running state, compute the elapsed time since its
last_heartbeat. - If any worker's elapsed time exceeds the configured heartbeat timeout, return
true. - If all workers are within the timeout window, return
false.
Design considerations:
- Only workers in running state are checked. Workers that are starting up or shutting down are excluded.
- The timeout threshold must be set conservatively enough to avoid false positives due to transient network delays or GC pauses, while being short enough to enable timely recovery.
task_failed
Checks whether any task within the job has entered a failed state.
| Aspect | Detail |
|---|---|
| Return type | Option<TaskFailedEvent>
|
Returns Some(event) |
If any task is in the Failed state, with details about the failure
|
Returns None |
If all tasks are healthy |
Algorithm:
- Iterate over all tasks across all workers.
- If any task has a status of
Failed, construct aTaskFailedEventcontaining the operator ID, subtask index, and error details. - Return the first failure found (if any).
The TaskFailedEvent includes:
- The operator ID of the failed task.
- The subtask index within the operator.
- The error message or panic information.
- The worker ID that reported the failure.
Usage in the Controller Loop
Both methods are called in the controller's main progress loop to detect failures early:
| Check | Frequency | Action on Detection |
|---|---|---|
worker_timedout() |
Every progress tick | Transition job to recovery state |
task_failed() |
Every progress tick | Transition job to recovery state with failure details |
When either method signals a failure, the controller initiates the recovery process: identifying the last successful checkpoint, rescheduling workers, and restoring state.
I/O
worker_timedout: Returnsbool(trueif any worker heartbeat has expired)task_failed: ReturnsOption<TaskFailedEvent>with failure details if any task is in a failed state
Implements
Principle:ArroyoSystems_Arroyo_Failure_Detection Heuristic:ArroyoSystems_Arroyo_Worker_Heartbeat_Timeout