Principle:ArroyoSystems Arroyo Pipeline State Monitoring

Summary

Pipeline State Monitoring is the principle of polling pipeline state to track job lifecycle transitions in the Arroyo streaming engine. The CLI polls the API at a fixed interval for job state changes, enabling it to wait for a pipeline to reach a target state (such as Running, Stopped, or Failed) and react accordingly. This provides a simple, robust mechanism for coordinating between the CLI and the asynchronous pipeline execution engine.

Theoretical Basis

State polling is a simple but robust mechanism for tracking asynchronous operations. In distributed systems, there are two primary approaches for monitoring the state of long-running operations:

Push-based (event-driven): The server notifies the client of state changes via WebSockets, server-sent events, or callback mechanisms.
Pull-based (polling): The client periodically queries the server for the current state.

Arroyo's CLI uses the pull-based approach for pipeline state monitoring. While push-based approaches can offer lower latency notifications, polling provides several advantages in this context:

Simplicity

Polling requires no additional infrastructure beyond the existing REST API. There is no need for WebSocket connections, event queues, or callback registration. The client simply calls the same get_pipeline_jobs API endpoint at regular intervals.

Robustness

Polling is inherently resilient to transient network issues. If a single poll fails, the next one will succeed and provide the correct state. There is no connection state to maintain or reconnect logic to implement. The retry logic built into the get_state function (with exponential backoff) further improves robustness.

Sufficient Responsiveness

By polling at a fixed interval of 100ms, the client achieves near-real-time state tracking. The overhead is negligible for a local cluster setup, and the 100ms latency is imperceptible to the user waiting for a pipeline to start or stop.

Polling Pattern

The state monitoring follows this pattern:

Initial State Query -- Get the current state of the pipeline's job.
Comparison Loop -- Repeatedly query the state at 100ms intervals.
State Change Detection -- When the state differs from the last observed state, log the transition.
Target Match -- If the current state matches any of the expected target states, return success.
Failure Detection -- If the current state is "Failed", return an error immediately regardless of the target states.

This pattern supports waiting for multiple possible target states. For example, the shutdown handler waits for either "Stopped" or "Failed", accepting either as a terminal condition.

State Transitions

A pipeline job in Arroyo progresses through a series of states:

State	Description	Terminal?
Created	Pipeline has been created but not yet scheduled	No
Scheduling	Controller is assigning workers and preparing execution	No
Running	Pipeline is actively processing data	No (target for submission)
Stopping	Pipeline is performing a final checkpoint before stopping	No
Stopped	Pipeline has stopped cleanly with a checkpoint	Yes (for shutdown)
Failed	Pipeline encountered an unrecoverable error	Yes (always error)

Retry Semantics

The state retrieval function includes built-in retry logic for API failures:

Up to 10 retries per state query
Initial backoff of 100ms with maximum backoff of 2 seconds
Warnings logged on each retry attempt

This means a transient API failure (e.g., the API server briefly overloaded during startup) does not cause the monitoring to fail. Only persistent API failures result in a panic.

Usage Contexts

State monitoring is used in two contexts within the local cluster:

Context	Target States	Purpose
Pipeline submission	`["Running"]`	Wait for the pipeline to start before printing the dashboard URL
Graceful shutdown	`["Stopped", "Failed"]`	Wait for the pipeline to complete its final checkpoint before exiting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment