Principle:ArroyoSystems Arroyo SQL Query Validation
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Repo (ArroyoSystems/arroyo), Doc (Arroyo Documentation) |
| Domains | Stream_Processing, SQL |
| Last Updated | 2026-02-08 |
Overview
Validating SQL queries before pipeline creation in a streaming SQL engine. This involves parsing the SQL text, checking schema references against registered sources and sinks, validating user-defined function (UDF) usage, and returning structured validation results that contain either a pipeline graph preview on success or descriptive error messages on failure.
Description
SQL query validation is a critical pre-execution step in Arroyo's streaming pipeline lifecycle. Before a SQL query can be compiled into a runnable dataflow program, it must pass through several validation stages to ensure correctness. The validation process serves as a fast feedback loop for users, catching errors without incurring the cost of full compilation and deployment.
The validation pipeline in a streaming SQL system proceeds through multiple phases:
Lexical Analysis
The raw SQL text is tokenized into a stream of lexemes (keywords, identifiers, literals, operators, punctuation). Lexical errors such as unterminated string literals or invalid characters are caught at this stage.
Parsing (AST Construction)
The token stream is parsed into an Abstract Syntax Tree (AST) according to the SQL grammar. Syntax errors -- such as missing clauses, misplaced keywords, or malformed expressions -- are detected here. Arroyo uses Apache DataFusion's SQL parser as its foundation, which supports a broad SQL dialect.
Semantic Analysis
The AST is analyzed in the context of the system's catalog of registered sources, sinks, and UDFs:
- Schema resolution: Table references in
FROMclauses are resolved against registered connections (Kafka topics, file sources, etc.). Column references are validated against the schemas of those tables. - Type checking: Expressions are type-checked to ensure operand compatibility -- for example, verifying that arithmetic operators are applied to numeric types and that comparison operands are compatible.
- UDF validation: If the query references user-defined functions, those functions are checked for existence, correct arity, and compatible argument types. UDF definitions provided alongside the query are compiled and registered before validation proceeds.
- Streaming-specific constraints: Validation enforces rules unique to streaming SQL, such as requiring watermark definitions for time-based operations, ensuring window functions reference valid time columns, and verifying that joins include appropriate windowing constraints.
Query Plan Generation (Preview)
On successful validation, a logical query plan is generated and translated into a pipeline graph preview. This graph shows the structure of the dataflow -- sources, operators, and sinks -- without actually compiling the full executable program. The preview allows users to understand the topology of their query before committing to pipeline creation.
Error Reporting
When validation fails at any stage, the system collects and returns structured error messages. These messages identify the nature and location of errors, enabling users to correct their queries iteratively.
Usage
SQL query validation is applied in the following scenarios:
- Interactive query development: Users submit SQL queries through the Arroyo web console or REST API and receive immediate validation feedback before creating a pipeline.
- CI/CD pipeline checks: Automated systems validate SQL queries as part of a deployment pipeline, catching errors before they reach production.
- UDF integration testing: When developing new UDFs, validation confirms that the function signatures and types are compatible with their usage in SQL queries.
- Schema migration verification: After modifying source or sink schemas, validation can confirm that existing queries remain compatible.
Theoretical Basis
Compiler Front-End Theory
SQL query validation implements the classical compiler front-end pipeline: lexical analysis (tokenization), syntactic analysis (parsing to AST), and semantic analysis (type checking and name resolution). Each phase produces progressively richer representations of the input and catches different classes of errors.
Streaming SQL Extensions
Standard SQL validation is extended for streaming contexts with additional constraints:
- Watermark references: Streaming queries that use event-time windowing must reference columns with associated watermark definitions. The validator ensures that time-based operations (tumbling windows, sliding windows, session windows) reference valid watermarked columns.
- Window function correctness: Window specifications must use supported window types and valid time intervals. The validator rejects unbounded aggregations in streaming contexts where they would require unbounded state.
- Streaming join constraints: Joins in streaming SQL require windowed or temporal conditions to bound the state required for join processing. The validator enforces that joins include appropriate time-based predicates.
Type System
The type checker implements a structural type system over SQL's type hierarchy (integers, floats, strings, timestamps, intervals, arrays, structs). Type inference propagates types through expressions, and implicit coercions are applied where the SQL standard permits them (e.g., integer to float promotion).