Implementation:ArroyoSystems Arroyo Compile Sql
| Field | Value |
|---|---|
| Sources | ArroyoSystems/arroyo |
| Domains | Stream_Processing, SQL, Query_Planning |
| Last Updated | 2026-02-08 |
Overview
compile_sql and parse_and_get_program are the core functions responsible for compiling a SQL query into an executable streaming dataflow program in the Arroyo engine. compile_sql orchestrates the full compilation pipeline from the API layer -- resolving connections, registering UDFs, and invoking the planner -- while parse_and_get_program performs the actual SQL parsing, planning, and program generation within the planner crate.
Description
The SQL compilation process involves two key functions that work together across crate boundaries:
compile_sql (API Layer)
The compile_sql function lives in the API crate and serves as the entry point for SQL compilation from HTTP request handlers. It performs the following steps:
- Schema provider construction: Builds an
ArroyoSchemaProviderpopulated with the user's registered connections (Kafka topics, Kinesis streams, file sources, etc.) by querying the database. - UDF registration: Compiles and registers any user-defined functions provided alongside the query. UDFs are compiled as Rust functions and made available to the SQL planner.
- Configuration assembly: Constructs a
SqlConfigwith the requested parallelism level and other execution parameters. - Delegation to planner: Calls
parse_and_get_programwith the query, schema provider, and configuration. - Result packaging: Wraps the resulting
CompiledSqlwith connection IDs and UDF IDs for persistence.
parse_and_get_program (Planner Layer)
The parse_and_get_program function lives in the planner crate and handles the core compilation work:
- SQL parsing: Parses the SQL text into an AST using DataFusion's SQL parser.
- Logical planning: Converts the AST into a logical plan, resolving table references against the schema provider and type-checking all expressions.
- Streaming optimization: Applies streaming-specific optimization passes including predicate pushdown, projection pruning, and window optimization.
- Program generation: Translates the optimized logical plan into a
LogicalProgram-- the dataflow graph representation consisting of nodes (operators) and edges (data channels) with associated schemas, parallelism, and partitioning metadata.
Usage
These functions are used when creating or updating streaming pipelines. compile_sql is called by API endpoint handlers (create_pipeline, validate_query), while parse_and_get_program is available for direct use in testing or programmatic pipeline construction.
Code Reference
Source Location
- Repository
ArroyoSystems/arroyo(GitHub)
compile_sql:
- File
crates/arroyo-api/src/pipelines.rs- Lines
- L61--L186
parse_and_get_program:
- File
crates/arroyo-planner/src/lib.rs- Lines
- L551--L562
Signature
async fn compile_sql(
query: &str,
udfs: &Vec<Udf>,
parallelism: u64,
auth_data: &AuthData,
db: &DatabaseSource,
) -> Result<CompiledSql, ErrorResp>
pub async fn parse_and_get_program(
query: &str,
schema_provider: ArroyoSchemaProvider,
config: SqlConfig,
) -> Result<CompiledSql>
Import
// compile_sql is crate-private to arroyo-api
use arroyo_api::pipelines::compile_sql; // (internal usage only)
// parse_and_get_program is publicly exported from the planner crate
use arroyo_planner::parse_and_get_program;
I/O Contract
Inputs (compile_sql)
| Name | Type | Required | Description |
|---|---|---|---|
query |
&str |
Yes | The SQL query text to compile |
udfs |
&Vec<Udf> |
Yes | User-defined function definitions to register before compilation |
parallelism |
u64 |
Yes | Default parallelism level for operators in the dataflow graph |
auth_data |
&AuthData |
Yes | Authentication context identifying the user and organization |
db |
&DatabaseSource |
Yes | Database connection for resolving registered connections and schemas |
Inputs (parse_and_get_program)
| Name | Type | Required | Description |
|---|---|---|---|
query |
&str |
Yes | The SQL query text to parse and compile |
schema_provider |
ArroyoSchemaProvider |
Yes | Pre-populated schema provider with registered tables, connections, and UDFs |
config |
SqlConfig |
Yes | SQL compilation configuration including parallelism and optimization settings |
Outputs
| Field | Type | Description |
|---|---|---|
program |
LogicalProgram |
The compiled dataflow graph containing operator nodes, edges, and execution metadata |
connection_ids |
Vec<i64> |
Database IDs of all connections referenced by the compiled program |
udf_ids |
Vec<i64> |
Database IDs of all UDFs used in the compiled program |
On failure, compile_sql returns an ErrorResp with HTTP status and error details, while parse_and_get_program returns an anyhow::Error with a descriptive message.
Usage Examples
Internal API Usage
// Within an API handler, compile a user's SQL query:
let compiled = compile_sql(
"SELECT user_id, count(*) as cnt
FROM clicks
GROUP BY user_id, TUMBLE(event_time, INTERVAL '1' MINUTE)",
&udfs,
4, // parallelism
&auth_data,
&db,
).await?;
// The compiled result contains the dataflow program
let program: LogicalProgram = compiled.program;
let connection_ids: Vec<i64> = compiled.connection_ids;
Direct Planner Usage
use arroyo_planner::parse_and_get_program;
use arroyo_planner::ArroyoSchemaProvider;
let schema_provider = ArroyoSchemaProvider::new();
// ... register tables and UDFs on schema_provider ...
let config = SqlConfig {
default_parallelism: 4,
..Default::default()
};
let compiled = parse_and_get_program(
"SELECT * FROM source WHERE value > 100",
schema_provider,
config,
).await?;