Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ArroyoSystems Arroyo Failure Detector

From Leeroopedia


Template:Implementation

Failure Detector Implementation

Implementation of RunningJobModel::worker_timedout and task_failed.

These two methods on RunningJobModel implement the dual-mode failure detection strategy: heartbeat-based crash detection and event-driven task failure detection.

Code Reference

  • File: crates/arroyo-controller/src/job_controller/mod.rs (L553-L578)

Core Interface

impl RunningJobModel {
    fn worker_timedout(&self) -> bool {
        // Checks if any running worker's last_heartbeat exceeds timeout
    }

    fn task_failed(&self) -> Option<TaskFailedEvent> {
        // Returns Some(event) if any task is in Failed state
    }
}

Detailed Behavior

worker_timedout

Checks whether any running worker has exceeded the heartbeat timeout.

Aspect Detail
Return type bool
Returns true If any worker's last_heartbeat timestamp is older than the configured timeout threshold
Returns false If all workers have sent a heartbeat within the timeout window

Algorithm:

  1. Iterate over all registered workers in the RunningJobModel.
  2. For each worker that is in a running state, compute the elapsed time since its last_heartbeat.
  3. If any worker's elapsed time exceeds the configured heartbeat timeout, return true.
  4. If all workers are within the timeout window, return false.

Design considerations:

  • Only workers in running state are checked. Workers that are starting up or shutting down are excluded.
  • The timeout threshold must be set conservatively enough to avoid false positives due to transient network delays or GC pauses, while being short enough to enable timely recovery.

task_failed

Checks whether any task within the job has entered a failed state.

Aspect Detail
Return type Option<TaskFailedEvent>
Returns Some(event) If any task is in the Failed state, with details about the failure
Returns None If all tasks are healthy

Algorithm:

  1. Iterate over all tasks across all workers.
  2. If any task has a status of Failed, construct a TaskFailedEvent containing the operator ID, subtask index, and error details.
  3. Return the first failure found (if any).

The TaskFailedEvent includes:

  • The operator ID of the failed task.
  • The subtask index within the operator.
  • The error message or panic information.
  • The worker ID that reported the failure.

Usage in the Controller Loop

Both methods are called in the controller's main progress loop to detect failures early:

Check Frequency Action on Detection
worker_timedout() Every progress tick Transition job to recovery state
task_failed() Every progress tick Transition job to recovery state with failure details

When either method signals a failure, the controller initiates the recovery process: identifying the last successful checkpoint, rescheduling workers, and restoring state.

I/O

  • worker_timedout: Returns bool (true if any worker heartbeat has expired)
  • task_failed: Returns Option<TaskFailedEvent> with failure details if any task is in a failed state

Implements

Principle:ArroyoSystems_Arroyo_Failure_Detection Heuristic:ArroyoSystems_Arroyo_Worker_Heartbeat_Timeout

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment