Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:ArroyoSystems Arroyo SQL Pipeline Lifecycle

From Leeroopedia


Knowledge Sources
Domains Stream_Processing, SQL, Data_Engineering
Last Updated 2026-02-08 09:00 GMT

Overview

End-to-end process for creating, compiling, scheduling, and running a SQL streaming pipeline in Arroyo, from query submission through steady-state execution with periodic checkpointing to graceful shutdown.

Description

This workflow covers the complete lifecycle of a streaming SQL pipeline within the Arroyo platform. A user submits a SQL query referencing previously-configured connection tables (sources and sinks). The system compiles the SQL into a logical dataflow program using DataFusion with streaming extensions, persists the pipeline definition to the database, and creates a job. The controller state machine then drives the job through compilation, worker scheduling, task assignment, and execution. During steady-state operation, the engine periodically checkpoints operator state to object storage for fault tolerance. The pipeline runs until the user requests a stop or a failure triggers recovery.

Key aspects:

  • SQL is compiled into a logical dataflow graph with streaming-specific extensions (watermarks, windows, updating aggregates)
  • The controller uses a formal state machine pattern with typed transitions between states
  • Workers execute Arrow-based operators with parallelism and shuffle partitioning
  • Periodic checkpoints ensure exactly-once processing guarantees

Usage

Execute this workflow when you have configured connection tables for your data sources and sinks and want to run a continuous streaming SQL query. This is the primary use case for Arroyo: transforming, aggregating, joining, or filtering real-time data streams using SQL.

Execution Steps

Step 1: SQL Query Validation

Submit the SQL query to the validation endpoint. The system builds a schema provider by loading all registered connection tables, connection profiles, global UDFs, and local UDFs. The SQL is then parsed using Arroyo's custom SQL dialect (extending DataFusion's parser) and compiled into a logical plan to verify correctness without creating a persistent pipeline.

Key considerations:

  • The query must reference valid connection tables as sources and sinks
  • Streaming-specific SQL extensions (tumbling windows, sliding windows, session windows, watermarks) are validated
  • UDFs referenced in the query must be pre-registered and compiled

Step 2: Pipeline Creation

Create the pipeline by submitting the validated SQL query along with configuration parameters (parallelism, checkpoint interval). The API compiles the SQL into a full logical program, enforces organization-level limits on parallelism and operator count, serializes the program to protobuf, persists the pipeline definition to the database, and creates an associated job record. For sinks that use schema registries (Avro, JSON Schema), schemas are registered with the configured registry.

Key considerations:

  • Parallelism can be configured per-pipeline or defaults to a system-wide setting
  • Preview mode replaces all sinks with preview sinks for testing
  • The compiled program includes the complete dataflow graph with operator types, edge connectivity, and schema information

Step 3: Job Compilation

The controller picks up the newly created job and enters the Compiling state. The program graph undergoes optimization passes including operator chaining (merging compatible adjacent operators to reduce serialization overhead). If the pipeline uses Rust UDFs, the compiler service compiles the UDF source code into a shared library and uploads it to storage.

Key considerations:

  • Operator chaining merges sequential operators that can run in-process without data serialization
  • UDF compilation uses a content-hash-based cache to avoid redundant builds
  • The compiled program is stored as a serialized protobuf in the database

Step 4: Worker Scheduling

The controller transitions to the Scheduling state, which calculates the required number of worker slots based on the program's parallelism settings and operator topology. The scheduler (embedded for single-node or Kubernetes for distributed) launches worker processes. The controller then waits for all workers to connect via gRPC and report readiness.

Key considerations:

  • Embedded scheduler runs workers as in-process tasks (for arroyo cluster mode)
  • Kubernetes scheduler creates pods with configurable resource requests
  • Workers must connect within a configurable timeout or the job transitions to a failure state

Step 5: Task Assignment and Execution Start

Once all workers are connected, the controller computes task-to-worker assignments distributing operator subtasks across available worker slots. If resuming from a checkpoint, it loads the checkpoint metadata for state restoration. Each worker receives a StartExecution command containing its assigned tasks, the program graph, and checkpoint restore information. Workers construct the physical execution graph by instantiating operator nodes, wiring data channels (forward for same-parallelism, shuffle for repartitioning), and spawning task threads.

Key considerations:

  • Forward edges connect 1:1 between operators with the same parallelism
  • Shuffle edges create all-to-all connections for repartitioning operations
  • Each operator subtask runs in its own tokio task with input/output queues and a control channel

Step 6: Steady State Processing

The controller enters the Running state and monitors the pipeline. Data flows through the operator graph: sources read from external systems, operators process records using Arrow columnar format, and sinks write to output systems. The controller periodically triggers checkpoint barriers at the configured interval. It also monitors worker heartbeats, collects metrics, and handles user requests (stop, restart, rescale).

Key considerations:

  • Checkpoint intervals are configurable and trigger distributed snapshots of all operator state
  • Worker failures during running trigger a transition to the Recovering state
  • Preview pipelines have a TTL after which they automatically stop

Step 7: Pipeline Shutdown

When the user requests a stop, the controller transitions through CheckpointStopping (takes a final checkpoint for clean resume), then Stopping (sends stop commands to workers and waits for graceful shutdown), and finally to the Stopped terminal state. If the pipeline completes naturally (bounded sources), it transitions through Finishing to the Finished state. Failures trigger the Failing state with error recording.

Key considerations:

  • A final checkpoint is taken before stopping to enable clean resume from the last consistent state
  • Workers are given a grace period to flush pending data to sinks
  • The pipeline can be restarted from the last checkpoint, maintaining exactly-once guarantees

Execution Diagram

GitHub URL

Workflow Repository