Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:ArroyoSystems Arroyo Worker Heartbeat Timeout

From Leeroopedia




Knowledge Sources
Domains Reliability, Stream_Processing
Last Updated 2026-02-08 08:00 GMT

Overview

Failure detection heuristic: 5-second heartbeat interval with 30-second timeout provides 6 missed heartbeats tolerance, balanced with 10-minute worker startup and 2-minute task startup grace periods.

Description

Arroyo uses heartbeat-based failure detection between workers and the controller. Workers send heartbeats every 5 seconds via gRPC. The controller marks a worker as failed if no heartbeat is received within the configured timeout (default 30 seconds). Separate startup timeouts allow for slow initialization (worker startup: 10 minutes for compilation/loading, task startup: 2 minutes for state restoration). After 20 consecutive restarts without achieving 2 minutes of healthy operation, the pipeline permanently fails.

Usage

Apply this heuristic when configuring failure detection sensitivity. Reduce the timeout for faster failure detection in stable environments. Increase it if workers experience temporary GC pauses or network hiccups. Adjust startup timeouts based on pipeline state size (larger state = longer recovery).

The Insight (Rule of Thumb)

  • Action: Configure `pipeline.worker-heartbeat-timeout`, `pipeline.worker-startup-time`, `pipeline.task-startup-time`, `pipeline.allowed-restarts`, and `pipeline.healthy-duration`.
  • Value: Defaults: heartbeat timeout = 30s, worker startup = 10m, task startup = 2m, allowed restarts = 20, healthy duration = 2m.
  • Trade-off: Shorter timeout = faster detection + more false positives. Longer timeout = fewer false positives + slower recovery start.
  • Restart policy: After `healthy-duration` (2m) of continuous operation, the restart counter resets. Max 20 restarts before permanent failure.

Reasoning

The 30-second timeout with 5-second heartbeat interval means 6 consecutive missed heartbeats trigger failure detection. This is generous enough to tolerate transient network issues but fast enough to detect genuine failures. The separate startup timeouts are critical because:

  • Worker startup (10m): Workers may need to compile UDFs, download artifacts, or initialize heavy libraries. The 10-minute default accounts for cold starts.
  • Task startup (2m): Tasks restore state from checkpoints. Large state tables may take time to deserialize from Parquet and rebuild in memory.
  • Healthy duration (2m): Prevents restart loops where a worker crashes shortly after start. Only after 2 minutes of healthy operation does the restart counter reset.

The state backoff (initial 500ms, max 1m) applies when the controller retries state machine transitions, preventing thundering herd effects during recovery.

Network connection retry uses exponential backoff with jitter: `(attempt + 1) * (50ms + random 0-50ms)` with max 10 attempts, preventing synchronized reconnection storms.

Code Evidence

Default timeouts from `default.toml:8-14`:

allowed-restarts = 20
worker-heartbeat-timeout = "30s"
healthy-duration = "2m"
worker-startup-time = "10m"
task-startup-time = "2m"
state-initial-backoff = "500ms"
state-max-backoff = "1m"

Worker heartbeat loop from `lib.rs:311`:

let mut tick = tokio::time::interval(Duration::from_secs(5));
tick.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);

Network retry with jitter from `network_manager.rs:295-328`:

for i in 0..10 {
    match TcpStream::connect(&dest).await {
        Ok(tcp_stream) => { return ... }
        Err(e) => {
            warn!("Failed to connect to {dest}: {:?}", e);
            tokio::time::sleep(Duration::from_millis(
                (i + 1) * (50 + rand.random_range(1..50)),
            )).await;
        }
    }
}
panic!("failed to connect to {dest}");

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment