Principle:ArroyoSystems Arroyo Pipeline Creation

Metadata

Field	Value
Page Type	Principle
Knowledge Sources	Repo (ArroyoSystems/arroyo), Doc (Arroyo Documentation)
Domains	Stream_Processing, Pipeline_Management
Last Updated	2026-02-08

Overview

Creating and persisting streaming pipeline configurations in the Arroyo engine. This principle covers the process of taking a compiled SQL program, encoding it into a storable format, and persisting it alongside execution metadata (name, parallelism, checkpoint interval) to create a runnable pipeline entity that can be scheduled, started, stopped, and updated throughout its lifecycle.

Description

Pipeline creation is the bridge between SQL compilation and job execution. Once a SQL query has been compiled into a LogicalProgram (a dataflow graph of operators), the pipeline creation process materializes this program into a persistent, versionable artifact in the system's database. This artifact, combined with execution configuration, forms a pipeline -- the fundamental unit of deployment in Arroyo.

Compilation and Encoding

The pipeline creation process begins by invoking the SQL compiler to produce a CompiledSql result from the user's query and UDF definitions. The resulting LogicalProgram is then serialized into a binary format (Protocol Buffers) for compact storage. This encoding preserves the complete structure of the dataflow graph -- operator definitions, edge connections, schemas, and partitioning strategies -- in a format that can be efficiently deserialized at job startup time.

Metadata Association

Each pipeline is associated with configuration metadata that governs its execution:

Name: A human-readable identifier for the pipeline, unique within a user's namespace.
Parallelism: The default number of parallel instances for each operator in the dataflow graph. This can be overridden per-operator but provides a baseline for resource allocation.
Checkpoint interval: The duration between consistent snapshots of operator state. Shorter intervals provide faster recovery at the cost of higher checkpoint overhead; longer intervals reduce overhead but increase recovery time.
Preview mode: A flag indicating whether the pipeline should run in preview mode, where output is streamed to the web console rather than written to configured sinks.
Sink enablement: A flag controlling whether sink operators are active. Disabling sinks allows testing query logic without side effects on external systems.

Persistence and Versioning

Pipelines are persisted as database records with the following lifecycle considerations:

Pipeline record: A row in the pipelines table containing the pipeline ID, name, owner, creation timestamp, and current configuration.
Pipeline version: Each update to a pipeline's SQL query or configuration creates a new version, preserving the history of changes and enabling rollback.
Job association: Upon creation, a job record is created that references the pipeline. The job is the runtime entity that tracks execution state (running, stopped, failed) and checkpoint progress.
Connection tracking: The IDs of all connections (sources and sinks) referenced by the pipeline are recorded, enabling dependency tracking and impact analysis when connections are modified.

Job Record Creation

As part of pipeline creation, a corresponding job record is initialized in the Created state. This job record enters the state machine lifecycle (see Principle:ArroyoSystems_Arroyo_Job_State_Machine) and progresses through compilation, scheduling, and execution stages. The job record includes:

A reference to the pipeline and its version
The serialized program graph
Execution configuration (parallelism, checkpoint interval)
Current state and state transition history

Usage

Pipeline creation is used in the following scenarios:

New pipeline deployment: A user submits a SQL query through the web console or REST API, and a new pipeline is created with the specified configuration.
Pipeline updates: When a user modifies an existing pipeline's query or configuration, a new pipeline version is created and the associated job is restarted with the updated program.
Preview execution: A pipeline is created in preview mode with sinks disabled, allowing users to observe query results in real time without writing to external systems.
Programmatic pipeline management: CI/CD systems or orchestration tools create pipelines via the REST API as part of automated deployment workflows.
Pipeline cloning: An existing pipeline's configuration can be used as a template for creating new pipelines with modified parameters.

Theoretical Basis

Artifact Materialization

Pipeline creation implements the pattern of materializing a compiled query plan into a persistent artifact. This is analogous to the compilation-to-binary step in traditional software: the compiled program (analogous to object code) is serialized and stored, decoupling the compilation phase from the execution phase. This separation enables:

Deferred execution: A pipeline can be created now and started later.
Reproducibility: The exact program that was compiled can be retrieved and re-executed.
Versioning: Multiple versions of a pipeline can coexist, enabling rollback and A/B testing.

Execution Configuration

The execution metadata associated with a pipeline represents the deployment configuration -- parameters that affect how the program runs without changing its logic. Key configuration dimensions include:

Parallelism: Determines resource allocation and throughput capacity. Higher parallelism increases throughput but requires more cluster resources. The optimal setting depends on input data rate and operator complexity.
Checkpoint interval: Balances recovery time against steady-state overhead. In Chandy-Lamport style distributed snapshots, shorter intervals mean more frequent barrier propagation and state serialization, but faster recovery after failures.

Lifecycle Management

Pipeline creation establishes the initial state of a managed lifecycle. The pipeline entity progresses through well-defined states (created, running, stopped, failed) with associated transitions. This lifecycle management pattern ensures that:

Resource allocation and deallocation are tracked.
State is consistently maintained across failures and restarts.
Users have visibility into pipeline status and history.

Related Pages

Implementation:ArroyoSystems_Arroyo_Create_Pipeline_Int

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment