Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ArroyoSystems Arroyo Local Cluster Initialization

From Leeroopedia


Template:Principle

Summary

Local Cluster Initialization is the principle of initializing a complete streaming cluster within a single operating system process for local development and testing. In Arroyo, this is realized through the arroyo run command, which combines the controller, API server, embedded scheduler, and SQLite-backed metadata storage into one process, enabling developers to execute full streaming pipelines without deploying any external infrastructure.

Theoretical Basis

Embedded cluster mode collapses a distributed architecture into a single process for development ergonomics. A production Arroyo deployment consists of multiple independent services -- a controller for orchestration, an API server for external interaction, a scheduler for worker management, and a PostgreSQL database for metadata persistence. For local development, spinning up and managing all these services is prohibitively complex. The local cluster initialization principle addresses this by composing all services in-process.

Key Concepts

In-Process Service Composition

Rather than running the controller, API, and scheduler as separate networked services, local cluster initialization embeds them as concurrent tasks within a single Tokio runtime. The controller starts as a gRPC server on a dynamically assigned port, the API starts as an HTTP server on another dynamic port, and the scheduler runs embedded within the controller. Communication between these components occurs through local network connections to localhost, preserving the same API contracts used in production while eliminating the operational overhead of multiple processes.

Local State Management

Production Arroyo deployments rely on PostgreSQL for metadata storage, but the local cluster mode substitutes SQLite. This provides a zero-configuration, file-based metadata store that requires no external database server. The SQLite database is stored in a configurable state directory alongside checkpoint data, making the entire pipeline state self-contained and portable. When the state directory points to a remote location (such as S3), the SQLite file is fetched on startup and backed up periodically during execution.

Embedded Scheduling

In production, Arroyo supports multiple scheduler backends (Kubernetes, process-based, etc.) for launching worker tasks. In local cluster mode, the scheduler is set to either Embedded (workers run as Tokio tasks within the same process) or Process (workers run as child processes). This eliminates the need for container orchestration or any external process manager, while still exercising the same scheduling interfaces.

State Directory Management

The local cluster manages a state directory that serves as the single source of truth for both checkpoint data and SQLite metadata. If no state directory is specified, a new one is created under the configured checkpoint URL with a timestamped name. If a state directory is provided and points to remote storage, the initialization sequence downloads the existing SQLite database before starting the cluster. This enables seamless resume: restarting with the same state directory picks up from the last checkpoint.

Configuration Adjustments

During local cluster initialization, the system makes several configuration overrides to adapt the distributed architecture for single-process execution:

Configuration Production Value Local Override Purpose
database.type PostgreSQL SQLite Eliminate external database dependency
database.sqlite.path N/A State directory path Point to local SQLite file
api.http_port Fixed port 0 (dynamic) Avoid port conflicts
controller.rpc_port Fixed port 0 (dynamic) Avoid port conflicts
controller.scheduler Kubernetes/Process Embedded or Process Remove orchestrator dependency
pipeline.default_sink Configurable Stdout Direct output to terminal for development

Initialization Sequence

The local cluster initialization follows a strict ordering of steps:

  1. State Directory Resolution -- Determine the state directory from arguments, configuration, or generate a new timestamped path.
  2. Database Preparation -- MaybeLocalDb::from_dir() fetches the SQLite database from remote storage if needed, or uses the local path directly.
  3. Configuration Override -- Apply local-mode configuration changes (SQLite, dynamic ports, embedded scheduler, stdout sink).
  4. Database Connection -- init_connection() opens a read-only SQLite connection for backup operations.
  5. Controller Start -- ControllerServer::new(db).start(guard) starts the gRPC controller with its embedded scheduler.
  6. API Server Start -- arroyo_api::start_server(db, guard) starts the HTTP API on a dynamically assigned port.
  7. Client Creation -- Create an arroyo_openapi::Client pointed at the local API server.
  8. Backup Scheduling -- Start a periodic database backup task (every 60 seconds) for remote state directories.
  9. Pipeline Submission -- Submit the SQL query through the API and wait for it to reach the Running state.

Relationship to Distributed Mode

The local cluster initialization principle preserves the same code paths and API contracts used in a full distributed deployment. The only differences are in configuration and process topology. This means:

  • Bugs found during local development are representative of production behavior.
  • State files produced locally can be used to bootstrap a production deployment.
  • The same SQL queries and pipeline definitions work identically in both modes.

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment