Implementation:ArroyoSystems Arroyo Failure Detector

Failure Detector Implementation

Implementation of RunningJobModel::worker_timedout and task_failed.

These two methods on RunningJobModel implement the dual-mode failure detection strategy: heartbeat-based crash detection and event-driven task failure detection.

Code Reference

File: crates/arroyo-controller/src/job_controller/mod.rs (L553-L578)

Core Interface

impl RunningJobModel {
    fn worker_timedout(&self) -> bool {
        // Checks if any running worker's last_heartbeat exceeds timeout
    }

    fn task_failed(&self) -> Option<TaskFailedEvent> {
        // Returns Some(event) if any task is in Failed state
    }
}

Detailed Behavior

`worker_timedout`

Checks whether any running worker has exceeded the heartbeat timeout.

Aspect	Detail
Return type	`bool`
Returns `true`	If any worker's `last_heartbeat` timestamp is older than the configured timeout threshold
Returns `false`	If all workers have sent a heartbeat within the timeout window

Algorithm:

Iterate over all registered workers in the RunningJobModel.
For each worker that is in a running state, compute the elapsed time since its last_heartbeat.
If any worker's elapsed time exceeds the configured heartbeat timeout, return true.
If all workers are within the timeout window, return false.

Design considerations:

Only workers in running state are checked. Workers that are starting up or shutting down are excluded.
The timeout threshold must be set conservatively enough to avoid false positives due to transient network delays or GC pauses, while being short enough to enable timely recovery.

`task_failed`

Checks whether any task within the job has entered a failed state.

Aspect	Detail
Return type	`Option<TaskFailedEvent>`
Returns `Some(event)`	If any task is in the `Failed` state, with details about the failure
Returns `None`	If all tasks are healthy

Algorithm:

Iterate over all tasks across all workers.
If any task has a status of Failed, construct a TaskFailedEvent containing the operator ID, subtask index, and error details.
Return the first failure found (if any).

The TaskFailedEvent includes:

The operator ID of the failed task.
The subtask index within the operator.
The error message or panic information.
The worker ID that reported the failure.

Usage in the Controller Loop

Both methods are called in the controller's main progress loop to detect failures early:

Check	Frequency	Action on Detection
`worker_timedout()`	Every progress tick	Transition job to recovery state
`task_failed()`	Every progress tick	Transition job to recovery state with failure details

When either method signals a failure, the controller initiates the recovery process: identifying the last successful checkpoint, rescheduling workers, and restoring state.

I/O

worker_timedout: Returns bool (true if any worker heartbeat has expired)
task_failed: Returns Option<TaskFailedEvent> with failure details if any task is in a failed state

Implements

Principle:ArroyoSystems_Arroyo_Failure_Detection Heuristic:ArroyoSystems_Arroyo_Worker_Heartbeat_Timeout

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment