Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:ArroyoSystems Arroyo Local Pipeline Execution

From Leeroopedia


Knowledge Sources
Domains Stream_Processing, SQL, CLI
Last Updated 2026-02-08 09:00 GMT

Overview

End-to-end process for executing a SQL streaming pipeline locally using the arroyo run command, which starts an embedded cluster, compiles the query, runs it as a self-contained process, and outputs results to stdout or configured sinks.

Description

This workflow covers the standalone pipeline execution mode provided by the arroyo run CLI subcommand. Unlike the full cluster mode (which requires a persistent API server and controller), arroyo run spins up an embedded controller, API server, and worker within a single process. It reads a SQL query from a file or stdin, validates and compiles it, creates a pipeline, and monitors execution until the user terminates it or the pipeline completes. This mode supports state persistence across restarts through local SQLite or remote object storage, enabling long-running stateful pipelines without a full distributed deployment.

Key aspects:

  • Single-process execution with embedded controller, API, and worker
  • SQLite database for metadata storage (with optional remote backup to S3/GCS)
  • Default stdout sink when no explicit sink is configured
  • Graceful shutdown with final checkpoint for clean restarts
  • State persistence enables resuming from the last checkpoint across process restarts

Usage

Execute this workflow when you want to quickly run a SQL streaming query without deploying a full Arroyo cluster. This is ideal for development, testing, edge deployments, or single-node production use cases where distributed scaling is not required. Use arroyo run for rapid prototyping of SQL queries against live data sources.

Execution Steps

Step 1: Prepare SQL Query

Write the SQL pipeline query in a file or prepare it for stdin input. The query can include CREATE TABLE statements for inline connection table definitions, or reference pre-existing connection tables. If no explicit sink is specified, the system defaults to stdout output. Connection tables defined inline via CREATE TABLE WITH clauses do not require separate connection setup.

Key considerations:

  • The SQL file can contain multiple statements (CREATE TABLE for sources/sinks, then SELECT/INSERT for the pipeline)
  • The --sink-stdout flag or default behavior routes output to the terminal
  • Complex pipelines with windowing, joins, and aggregations are fully supported

Step 2: Initialize Local Cluster

The arroyo run command initializes a self-contained local cluster. It configures the database backend to SQLite (creating or opening a local database file), sets the scheduler to the embedded process scheduler, starts an embedded API server and controller, and waits for all services to be ready. If a remote storage backend is configured, the system downloads any existing SQLite database from object storage to enable state resume.

Key considerations:

  • SQLite with WAL mode provides concurrent read/write access from the single process
  • The embedded scheduler runs worker tasks as in-process tokio tasks rather than separate processes
  • If resuming a previous run, the existing pipeline state is loaded from the local database
  • Database migrations are run automatically on startup

Step 3: Submit and Compile Pipeline

The system submits the SQL query to the local API server for validation and compilation. The query is parsed, connection tables are resolved, UDFs are compiled if present, and the logical dataflow program is generated. If a pipeline with the same query already exists in the database (from a previous run), the existing pipeline is reused rather than creating a new one.

Key considerations:

  • Pipeline reuse enables seamless checkpoint-based resume across process restarts
  • The compilation process is identical to the full cluster mode
  • Inline connection tables are created as part of the pipeline compilation
  • Compilation errors are reported to stderr with descriptive error messages

Step 4: Execute Pipeline

The controller picks up the job and drives it through the state machine: Compiling, Scheduling (with embedded workers), Running. The system monitors the pipeline state by polling the local API. Data flows through the operator graph within the single process: source operators read from external systems, processing operators transform data, and sink operators write results (defaulting to stdout).

Key considerations:

  • All operators run as tokio tasks within the same process
  • Periodic checkpoints continue to run, storing state to the configured storage backend
  • Stdout output displays processed records in real-time
  • The process handles SIGINT/SIGTERM for graceful shutdown

Step 5: Graceful Shutdown

When the user sends a termination signal (Ctrl+C), the PipelineShutdownHandler initiates a graceful shutdown sequence. It requests a final checkpoint to capture the latest state, waits for the checkpoint to complete, then stops the pipeline. If a remote storage backend is configured, the SQLite database is backed up to object storage, enabling future restarts to resume from this point.

Key considerations:

  • The final checkpoint ensures no data is lost between the last periodic checkpoint and shutdown
  • Database backups are scheduled periodically during execution (not just at shutdown)
  • A second termination signal forces immediate shutdown without waiting for the checkpoint
  • The pipeline can be resumed by running the same SQL query again with arroyo run

Execution Diagram

GitHub URL

Workflow Repository