Principle:ArroyoSystems Arroyo SQL Compilation
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Repo (ArroyoSystems/arroyo), Doc (Arroyo Documentation), Paper (The Dataflow Model) |
| Domains | Stream_Processing, SQL, Query_Planning |
| Last Updated | 2026-02-08 |
Overview
Compiling SQL queries into executable streaming dataflow programs. This principle covers the full compilation pipeline: SQL parsing, logical plan generation, streaming plan optimization, and physical dataflow graph creation. The result is a directed acyclic graph (DAG) of operators that can be distributed across a cluster for parallel stream processing.
Description
SQL compilation in a streaming engine transforms a declarative SQL query into an imperative, distributed execution plan. Unlike batch SQL compilation, streaming SQL compilation must account for unbounded inputs, continuous execution, watermark propagation, and windowed state management. Arroyo's compilation pipeline proceeds through several well-defined stages.
Stage 1: SQL Parsing
The raw SQL text is parsed into an Abstract Syntax Tree (AST). Arroyo leverages Apache DataFusion's SQL parser, which supports a broad SQL dialect including streaming extensions. The parser handles:
- Standard DML statements (
SELECT,INSERT INTO) - Table expressions (
FROM,JOIN, subqueries) - Window specifications (
TUMBLE,HOP,SESSION) - Common table expressions (
WITHclauses) - User-defined function invocations
Stage 2: Logical Plan Generation
The AST is converted into a logical plan using relational algebra operators. Each SQL construct maps to one or more logical operators:
SELECTcolumns map to Projection operatorsWHEREclauses map to Selection (Filter) operatorsJOINclauses map to Join operators with appropriate join type (inner, left, right, full)GROUP BYclauses map to Aggregation operators- Window functions map to WindowAggregation operators with associated window specifications
- Source table references map to Scan operators bound to registered connections
The logical plan is a tree of these operators, representing the query's semantics without specifying execution details.
Stage 3: Streaming Plan Optimization
The logical plan undergoes optimization passes specific to streaming execution:
- Predicate pushdown: Filter conditions are pushed as close to sources as possible, reducing the volume of data flowing through the pipeline.
- Projection pruning: Unused columns are eliminated early, reducing serialization and memory costs.
- Operator fusion: Adjacent compatible operators (e.g., consecutive filters or projections) are merged into single operators to reduce scheduling overhead.
- Window optimization: Window specifications are analyzed to determine optimal state management strategies and watermark propagation rules.
- Join reordering: For multi-way joins, the optimizer may reorder join operations to minimize intermediate state size.
Stage 4: Physical Dataflow Graph Creation
The optimized logical plan is translated into a physical dataflow graph -- a DAG of concrete operator implementations connected by typed data channels. Each logical operator is mapped to one or more physical operators:
- Logical scans become source operators bound to specific connector implementations (Kafka, Kinesis, filesystem)
- Logical aggregations become stateful operators with configurable state backends
- Logical joins become temporal join operators with appropriate windowing semantics
- Sink operators are appended for output connections
The physical graph includes parallelism annotations, partitioning strategies (hash, round-robin, broadcast), and serialization formats for inter-operator communication.
Usage
SQL compilation is invoked in the following scenarios:
- Pipeline creation: When a user submits a new SQL pipeline through the API or web console, the query is compiled into a dataflow program before being persisted and scheduled.
- Pipeline updates: When modifying an existing pipeline's SQL query, recompilation produces a new dataflow graph that replaces the previous version.
- Query validation: A lightweight form of compilation is used during validation to generate a pipeline graph preview without producing a fully executable program.
- Preview mode: Compilation with sinks disabled allows users to preview query results without writing to external systems.
Theoretical Basis
Relational Algebra
SQL compilation is grounded in relational algebra, which provides a formal framework for expressing data transformations. The core operations -- selection (sigma), projection (pi), join (bowtie), aggregation, and set operations -- form a closed algebra over relations. Each SQL query can be expressed as a composition of these operations, and the algebraic properties (commutativity, associativity, distributivity) enable systematic optimization.
The Dataflow Model
Arroyo's compilation target follows the Dataflow Model (Akidau et al., 2015), which unifies batch and streaming computation. Key concepts from this model that influence compilation include:
- Event time vs. processing time: The compiler must track which time domain each operation uses, ensuring correct watermark propagation.
- Windowing: The compiler translates SQL window specifications into windowing strategies (fixed, sliding, session) that determine how unbounded data is grouped for aggregation.
- Triggers: The compiler determines when intermediate results are emitted based on watermark progress and trigger policies.
- Accumulation: The compiler configures whether results accumulate, retract, or use accumulating-and-retracting semantics.
Query Optimization Theory
Streaming query optimization extends classical database optimization with additional cost dimensions:
- State size: Operators that maintain state (joins, aggregations) incur memory costs proportional to their window sizes. The optimizer considers state size when choosing execution strategies.
- Latency: The optimizer balances throughput against end-to-end latency, considering operator fusion and parallelism tradeoffs.
- Network cost: In a distributed setting, the optimizer minimizes data shuffling by choosing partitioning strategies that co-locate related data.
Physical Plan Generation
The translation from logical to physical plan assigns concrete implementations to each operator, determines parallelism levels, and establishes data partitioning. This phase resolves abstract operations into specific algorithms -- for example, choosing between hash-based and sort-based aggregation, or between broadcast and hash-partitioned joins.