Principle:ArroyoSystems Arroyo UDF Validation
Summary
This principle covers validating UDF source code before compilation in the Arroyo streaming engine. Validation is the critical pre-compilation step that parses user-provided function definitions, extracts metadata (signatures, types, dependencies), and performs lightweight compilation checks to provide fast feedback on errors.
Core Concept
UDF validation is a multi-stage pipeline that transforms raw user source code into a structured representation suitable for compilation. The pipeline operates on the source text without producing final compiled artifacts, enabling rapid iteration during UDF development.
The validation stages are:
- Lexical and syntactic analysis -- Parse the source code to extract function signatures
- Type checking -- Map user-facing types to Arrow DataTypes
- Async detection -- Determine whether the function is synchronous or asynchronous
- Dependency extraction -- Parse TOML dependency declarations from comment blocks
- Compilation check -- Run
cargo checkto verify the code compiles without producing artifacts
Theoretical Basis
Static Validation for Early Error Detection
Static validation of user-provided code before full compilation serves a critical role in the UDF development workflow. By catching errors early -- at the parsing and type-checking stages -- the system provides fast feedback without incurring the cost of a full compilation cycle.
Parsing Strategy
The system uses language-specific parsing tools:
| Language | Parser | Purpose |
|---|---|---|
| Rust | syn crate |
Full syntactic analysis of function items, extracting parameter names, types, return type, and async markers |
| Python | AST parsing | Extract function signatures and type annotations from decorated functions |
Type Mapping
User-facing types must be mapped to Arrow DataTypes for integration with the columnar execution engine:
| Rust Type | Python Type | Arrow DataType |
|---|---|---|
i64 |
int |
Int64
|
f64 |
float |
Float64
|
String |
str |
Utf8
|
bool |
bool |
Boolean
|
Option<T> |
Optional[T] |
Nullable variant of mapped type |
Dependency Resolution
Rust UDFs can declare external crate dependencies via TOML blocks embedded in comments at the top of the source file. The validation pipeline extracts these using TOML parsing and includes them in the generated Cargo.toml for the UDF crate.
Lightweight Compilation Check
The final validation step runs cargo check, which performs full type checking and borrow checking without producing compiled artifacts. This is significantly faster than a full cargo build and catches semantic errors that syntactic parsing alone cannot detect.
Validation Pipeline Flow
The validation pipeline proceeds as follows:
- Input: Raw UDF source code as a string
- Stage 1: Parse source to extract function item(s)
- Stage 2: Extract function name, parameters, return type, and async marker
- Stage 3: Map all parameter and return types to Arrow DataTypes
- Stage 4: Extract TOML dependency block from leading comments (if present)
- Stage 5: Construct a temporary crate and run
cargo check - Output: A
ParsedUdfFilecontaining the validated function metadata, or a descriptive error
Design Considerations
- Fast feedback: Validation should complete in seconds, not minutes, to support interactive UDF development in the web UI
- Descriptive errors: Error messages from each stage should clearly indicate what failed and where, including line numbers when available
- Language parity: Both Rust and Python UDFs should go through equivalent validation stages, though the implementation details differ by language