Principle:ArroyoSystems Arroyo UDF Authoring
Summary
This principle covers writing User-Defined Functions (UDFs) for extending SQL query capabilities in the Arroyo streaming engine. UDFs allow users to define custom scalar logic that can be invoked directly from SQL queries, augmenting the built-in function set with arbitrary computation. Arroyo supports both Rust and Python UDFs, with sync and async variants.
Core Concept
The fundamental pattern for UDF authoring relies on macros and decorators as metaprogramming tools. Users annotate a standard function with a language-specific marker:
- Rust: The
#[udf]proc macro attribute - Python: The
@udfdecorator
These annotations trigger automatic code generation that produces the necessary FFI (Foreign Function Interface) glue code for seamless integration with the streaming engine's runtime.
Theoretical Basis
UDFs extend a query engine's built-in function set by allowing user-provided logic to participate in the query evaluation pipeline. The macro/decorator pattern leverages metaprogramming to accomplish several critical tasks:
FFI Entry Point Generation
The macro/decorator generates FFI entry points that handle conversion between Arrow columnar format and native language types. This conversion is essential because the streaming engine operates on Arrow arrays internally, while user functions operate on scalar native types.
Async Execution Support
For I/O-bound UDFs (e.g., functions that call external APIs or databases), the system supports async variants. Async UDFs generate additional FFI entry points for lifecycle management:
__start-- initialize the async runtime__send-- submit a row for processing__drain_results-- collect completed results__stop_runtime-- shut down the async runtime
Type Safety Across FFI
The macro system enforces type safety across the FFI boundary by:
- Parsing the function signature at compile time
- Mapping Rust/Python types to corresponding Arrow DataTypes
- Generating type-checked conversion code in both directions
Dynamic Library Loading
Rust UDFs compile to dynamic libraries (.so on Linux, .dylib on macOS) that are loaded at runtime via dlopen2. This enables runtime extensibility without recompiling the engine.
Supported Languages
| Language | Annotation | Sync | Async | Output Format |
|---|---|---|---|---|
| Rust | #[udf] proc macro |
Yes | Yes | Dynamic library (.so/.dylib) |
| Python | @udf decorator |
Yes | Yes | Python module (interpreted) |
Usage Pattern
Rust UDF
#[udf]
fn my_udf(x: i64, y: String) -> Option<String> {
Some(format!("{}: {}", y, x))
}
Python UDF
@udf
def my_udf(x: int, y: str) -> str:
return f"{y}: {x}"
Design Considerations
- Nullability: Returning
Option<T>in Rust (orOptional[T]in Python) signals that the UDF may produce null values, which maps to nullable Arrow arrays. - Dependencies: Rust UDFs can declare external crate dependencies via TOML comment blocks, which are extracted and included in the generated Cargo.toml.
- Error Handling: UDF errors are captured and surfaced through the validation and compilation pipeline rather than crashing the streaming engine.