Implementation:ArroyoSystems Arroyo Compile Sql

Field	Value
Sources	ArroyoSystems/arroyo
Domains	Stream_Processing, SQL, Query_Planning
Last Updated	2026-02-08

Overview

compile_sql and parse_and_get_program are the core functions responsible for compiling a SQL query into an executable streaming dataflow program in the Arroyo engine. compile_sql orchestrates the full compilation pipeline from the API layer -- resolving connections, registering UDFs, and invoking the planner -- while parse_and_get_program performs the actual SQL parsing, planning, and program generation within the planner crate.

Description

The SQL compilation process involves two key functions that work together across crate boundaries:

compile_sql (API Layer)

The compile_sql function lives in the API crate and serves as the entry point for SQL compilation from HTTP request handlers. It performs the following steps:

Schema provider construction: Builds an ArroyoSchemaProvider populated with the user's registered connections (Kafka topics, Kinesis streams, file sources, etc.) by querying the database.
UDF registration: Compiles and registers any user-defined functions provided alongside the query. UDFs are compiled as Rust functions and made available to the SQL planner.
Configuration assembly: Constructs a SqlConfig with the requested parallelism level and other execution parameters.
Delegation to planner: Calls parse_and_get_program with the query, schema provider, and configuration.
Result packaging: Wraps the resulting CompiledSql with connection IDs and UDF IDs for persistence.

parse_and_get_program (Planner Layer)

The parse_and_get_program function lives in the planner crate and handles the core compilation work:

SQL parsing: Parses the SQL text into an AST using DataFusion's SQL parser.
Logical planning: Converts the AST into a logical plan, resolving table references against the schema provider and type-checking all expressions.
Streaming optimization: Applies streaming-specific optimization passes including predicate pushdown, projection pruning, and window optimization.
Program generation: Translates the optimized logical plan into a LogicalProgram -- the dataflow graph representation consisting of nodes (operators) and edges (data channels) with associated schemas, parallelism, and partitioning metadata.

Usage

These functions are used when creating or updating streaming pipelines. compile_sql is called by API endpoint handlers (create_pipeline, validate_query), while parse_and_get_program is available for direct use in testing or programmatic pipeline construction.

Code Reference

Source Location

Repository: ArroyoSystems/arroyo (GitHub)

compile_sql:

File: crates/arroyo-api/src/pipelines.rs
Lines: L61--L186

parse_and_get_program:

File: crates/arroyo-planner/src/lib.rs
Lines: L551--L562

Signature

async fn compile_sql(
    query: &str,
    udfs: &Vec<Udf>,
    parallelism: u64,
    auth_data: &AuthData,
    db: &DatabaseSource,
) -> Result<CompiledSql, ErrorResp>

pub async fn parse_and_get_program(
    query: &str,
    schema_provider: ArroyoSchemaProvider,
    config: SqlConfig,
) -> Result<CompiledSql>

Import

// compile_sql is crate-private to arroyo-api
use arroyo_api::pipelines::compile_sql;  // (internal usage only)

// parse_and_get_program is publicly exported from the planner crate
use arroyo_planner::parse_and_get_program;

I/O Contract

Inputs (compile_sql)

Name	Type	Required	Description
`query`	`&str`	Yes	The SQL query text to compile
`udfs`	`&Vec<Udf>`	Yes	User-defined function definitions to register before compilation
`parallelism`	`u64`	Yes	Default parallelism level for operators in the dataflow graph
`auth_data`	`&AuthData`	Yes	Authentication context identifying the user and organization
`db`	`&DatabaseSource`	Yes	Database connection for resolving registered connections and schemas

Inputs (parse_and_get_program)

Name	Type	Required	Description
`query`	`&str`	Yes	The SQL query text to parse and compile
`schema_provider`	`ArroyoSchemaProvider`	Yes	Pre-populated schema provider with registered tables, connections, and UDFs
`config`	`SqlConfig`	Yes	SQL compilation configuration including parallelism and optimization settings

Outputs

Field	Type	Description
`program`	`LogicalProgram`	The compiled dataflow graph containing operator nodes, edges, and execution metadata
`connection_ids`	`Vec<i64>`	Database IDs of all connections referenced by the compiled program
`udf_ids`	`Vec<i64>`	Database IDs of all UDFs used in the compiled program

On failure, compile_sql returns an ErrorResp with HTTP status and error details, while parse_and_get_program returns an anyhow::Error with a descriptive message.

Usage Examples

Internal API Usage

// Within an API handler, compile a user's SQL query:
let compiled = compile_sql(
    "SELECT user_id, count(*) as cnt
     FROM clicks
     GROUP BY user_id, TUMBLE(event_time, INTERVAL '1' MINUTE)",
    &udfs,
    4,          // parallelism
    &auth_data,
    &db,
).await?;

// The compiled result contains the dataflow program
let program: LogicalProgram = compiled.program;
let connection_ids: Vec<i64> = compiled.connection_ids;

Direct Planner Usage

use arroyo_planner::parse_and_get_program;
use arroyo_planner::ArroyoSchemaProvider;

let schema_provider = ArroyoSchemaProvider::new();
// ... register tables and UDFs on schema_provider ...

let config = SqlConfig {
    default_parallelism: 4,
    ..Default::default()
};

let compiled = parse_and_get_program(
    "SELECT * FROM source WHERE value > 100",
    schema_provider,
    config,
).await?;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment