Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ArroyoSystems Arroyo SQL Compilation

From Leeroopedia


Metadata

Field Value
Page Type Principle
Knowledge Sources Repo (ArroyoSystems/arroyo), Doc (Arroyo Documentation), Paper (The Dataflow Model)
Domains Stream_Processing, SQL, Query_Planning
Last Updated 2026-02-08

Overview

Compiling SQL queries into executable streaming dataflow programs. This principle covers the full compilation pipeline: SQL parsing, logical plan generation, streaming plan optimization, and physical dataflow graph creation. The result is a directed acyclic graph (DAG) of operators that can be distributed across a cluster for parallel stream processing.

Description

SQL compilation in a streaming engine transforms a declarative SQL query into an imperative, distributed execution plan. Unlike batch SQL compilation, streaming SQL compilation must account for unbounded inputs, continuous execution, watermark propagation, and windowed state management. Arroyo's compilation pipeline proceeds through several well-defined stages.

Stage 1: SQL Parsing

The raw SQL text is parsed into an Abstract Syntax Tree (AST). Arroyo leverages Apache DataFusion's SQL parser, which supports a broad SQL dialect including streaming extensions. The parser handles:

  • Standard DML statements (SELECT, INSERT INTO)
  • Table expressions (FROM, JOIN, subqueries)
  • Window specifications (TUMBLE, HOP, SESSION)
  • Common table expressions (WITH clauses)
  • User-defined function invocations

Stage 2: Logical Plan Generation

The AST is converted into a logical plan using relational algebra operators. Each SQL construct maps to one or more logical operators:

  • SELECT columns map to Projection operators
  • WHERE clauses map to Selection (Filter) operators
  • JOIN clauses map to Join operators with appropriate join type (inner, left, right, full)
  • GROUP BY clauses map to Aggregation operators
  • Window functions map to WindowAggregation operators with associated window specifications
  • Source table references map to Scan operators bound to registered connections

The logical plan is a tree of these operators, representing the query's semantics without specifying execution details.

Stage 3: Streaming Plan Optimization

The logical plan undergoes optimization passes specific to streaming execution:

  • Predicate pushdown: Filter conditions are pushed as close to sources as possible, reducing the volume of data flowing through the pipeline.
  • Projection pruning: Unused columns are eliminated early, reducing serialization and memory costs.
  • Operator fusion: Adjacent compatible operators (e.g., consecutive filters or projections) are merged into single operators to reduce scheduling overhead.
  • Window optimization: Window specifications are analyzed to determine optimal state management strategies and watermark propagation rules.
  • Join reordering: For multi-way joins, the optimizer may reorder join operations to minimize intermediate state size.

Stage 4: Physical Dataflow Graph Creation

The optimized logical plan is translated into a physical dataflow graph -- a DAG of concrete operator implementations connected by typed data channels. Each logical operator is mapped to one or more physical operators:

  • Logical scans become source operators bound to specific connector implementations (Kafka, Kinesis, filesystem)
  • Logical aggregations become stateful operators with configurable state backends
  • Logical joins become temporal join operators with appropriate windowing semantics
  • Sink operators are appended for output connections

The physical graph includes parallelism annotations, partitioning strategies (hash, round-robin, broadcast), and serialization formats for inter-operator communication.

Usage

SQL compilation is invoked in the following scenarios:

  • Pipeline creation: When a user submits a new SQL pipeline through the API or web console, the query is compiled into a dataflow program before being persisted and scheduled.
  • Pipeline updates: When modifying an existing pipeline's SQL query, recompilation produces a new dataflow graph that replaces the previous version.
  • Query validation: A lightweight form of compilation is used during validation to generate a pipeline graph preview without producing a fully executable program.
  • Preview mode: Compilation with sinks disabled allows users to preview query results without writing to external systems.

Theoretical Basis

Relational Algebra

SQL compilation is grounded in relational algebra, which provides a formal framework for expressing data transformations. The core operations -- selection (sigma), projection (pi), join (bowtie), aggregation, and set operations -- form a closed algebra over relations. Each SQL query can be expressed as a composition of these operations, and the algebraic properties (commutativity, associativity, distributivity) enable systematic optimization.

The Dataflow Model

Arroyo's compilation target follows the Dataflow Model (Akidau et al., 2015), which unifies batch and streaming computation. Key concepts from this model that influence compilation include:

  • Event time vs. processing time: The compiler must track which time domain each operation uses, ensuring correct watermark propagation.
  • Windowing: The compiler translates SQL window specifications into windowing strategies (fixed, sliding, session) that determine how unbounded data is grouped for aggregation.
  • Triggers: The compiler determines when intermediate results are emitted based on watermark progress and trigger policies.
  • Accumulation: The compiler configures whether results accumulate, retract, or use accumulating-and-retracting semantics.

Query Optimization Theory

Streaming query optimization extends classical database optimization with additional cost dimensions:

  • State size: Operators that maintain state (joins, aggregations) incur memory costs proportional to their window sizes. The optimizer considers state size when choosing execution strategies.
  • Latency: The optimizer balances throughput against end-to-end latency, considering operator fusion and parallelism tradeoffs.
  • Network cost: In a distributed setting, the optimizer minimizes data shuffling by choosing partitioning strategies that co-locate related data.

Physical Plan Generation

The translation from logical to physical plan assigns concrete implementations to each operator, determines parallelism levels, and establishes data partitioning. This phase resolves abstract operations into specific algorithms -- for example, choosing between hash-based and sort-based aggregation, or between broadcast and hash-partitioned joins.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment