Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ArroyoSystems Arroyo Compile Sql

From Leeroopedia


Field Value
Sources ArroyoSystems/arroyo
Domains Stream_Processing, SQL, Query_Planning
Last Updated 2026-02-08

Overview

compile_sql and parse_and_get_program are the core functions responsible for compiling a SQL query into an executable streaming dataflow program in the Arroyo engine. compile_sql orchestrates the full compilation pipeline from the API layer -- resolving connections, registering UDFs, and invoking the planner -- while parse_and_get_program performs the actual SQL parsing, planning, and program generation within the planner crate.

Description

The SQL compilation process involves two key functions that work together across crate boundaries:

compile_sql (API Layer)

The compile_sql function lives in the API crate and serves as the entry point for SQL compilation from HTTP request handlers. It performs the following steps:

  1. Schema provider construction: Builds an ArroyoSchemaProvider populated with the user's registered connections (Kafka topics, Kinesis streams, file sources, etc.) by querying the database.
  2. UDF registration: Compiles and registers any user-defined functions provided alongside the query. UDFs are compiled as Rust functions and made available to the SQL planner.
  3. Configuration assembly: Constructs a SqlConfig with the requested parallelism level and other execution parameters.
  4. Delegation to planner: Calls parse_and_get_program with the query, schema provider, and configuration.
  5. Result packaging: Wraps the resulting CompiledSql with connection IDs and UDF IDs for persistence.

parse_and_get_program (Planner Layer)

The parse_and_get_program function lives in the planner crate and handles the core compilation work:

  1. SQL parsing: Parses the SQL text into an AST using DataFusion's SQL parser.
  2. Logical planning: Converts the AST into a logical plan, resolving table references against the schema provider and type-checking all expressions.
  3. Streaming optimization: Applies streaming-specific optimization passes including predicate pushdown, projection pruning, and window optimization.
  4. Program generation: Translates the optimized logical plan into a LogicalProgram -- the dataflow graph representation consisting of nodes (operators) and edges (data channels) with associated schemas, parallelism, and partitioning metadata.

Usage

These functions are used when creating or updating streaming pipelines. compile_sql is called by API endpoint handlers (create_pipeline, validate_query), while parse_and_get_program is available for direct use in testing or programmatic pipeline construction.

Code Reference

Source Location

Repository
ArroyoSystems/arroyo (GitHub)

compile_sql:

File
crates/arroyo-api/src/pipelines.rs
Lines
L61--L186

parse_and_get_program:

File
crates/arroyo-planner/src/lib.rs
Lines
L551--L562

Signature

async fn compile_sql(
    query: &str,
    udfs: &Vec<Udf>,
    parallelism: u64,
    auth_data: &AuthData,
    db: &DatabaseSource,
) -> Result<CompiledSql, ErrorResp>
pub async fn parse_and_get_program(
    query: &str,
    schema_provider: ArroyoSchemaProvider,
    config: SqlConfig,
) -> Result<CompiledSql>

Import

// compile_sql is crate-private to arroyo-api
use arroyo_api::pipelines::compile_sql;  // (internal usage only)

// parse_and_get_program is publicly exported from the planner crate
use arroyo_planner::parse_and_get_program;

I/O Contract

Inputs (compile_sql)

Name Type Required Description
query &str Yes The SQL query text to compile
udfs &Vec<Udf> Yes User-defined function definitions to register before compilation
parallelism u64 Yes Default parallelism level for operators in the dataflow graph
auth_data &AuthData Yes Authentication context identifying the user and organization
db &DatabaseSource Yes Database connection for resolving registered connections and schemas

Inputs (parse_and_get_program)

Name Type Required Description
query &str Yes The SQL query text to parse and compile
schema_provider ArroyoSchemaProvider Yes Pre-populated schema provider with registered tables, connections, and UDFs
config SqlConfig Yes SQL compilation configuration including parallelism and optimization settings

Outputs

Field Type Description
program LogicalProgram The compiled dataflow graph containing operator nodes, edges, and execution metadata
connection_ids Vec<i64> Database IDs of all connections referenced by the compiled program
udf_ids Vec<i64> Database IDs of all UDFs used in the compiled program

On failure, compile_sql returns an ErrorResp with HTTP status and error details, while parse_and_get_program returns an anyhow::Error with a descriptive message.

Usage Examples

Internal API Usage

// Within an API handler, compile a user's SQL query:
let compiled = compile_sql(
    "SELECT user_id, count(*) as cnt
     FROM clicks
     GROUP BY user_id, TUMBLE(event_time, INTERVAL '1' MINUTE)",
    &udfs,
    4,          // parallelism
    &auth_data,
    &db,
).await?;

// The compiled result contains the dataflow program
let program: LogicalProgram = compiled.program;
let connection_ids: Vec<i64> = compiled.connection_ids;

Direct Planner Usage

use arroyo_planner::parse_and_get_program;
use arroyo_planner::ArroyoSchemaProvider;

let schema_provider = ArroyoSchemaProvider::new();
// ... register tables and UDFs on schema_provider ...

let config = SqlConfig {
    default_parallelism: 4,
    ..Default::default()
};

let compiled = parse_and_get_program(
    "SELECT * FROM source WHERE value > 100",
    schema_provider,
    config,
).await?;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment