Workflow:ArroyoSystems Arroyo UDF Development
| Knowledge Sources | |
|---|---|
| Domains | Stream_Processing, SQL, UDFs |
| Last Updated | 2026-02-08 09:00 GMT |
Overview
End-to-end process for writing, compiling, registering, and using user-defined functions (UDFs) in Arroyo SQL pipelines, supporting both Rust and Python languages with synchronous and asynchronous execution modes.
Description
This workflow covers the creation of custom functions that extend Arroyo's SQL capabilities beyond the built-in function library. Arroyo supports UDFs written in Rust (compiled to shared libraries at runtime) and Python (executed via PyO3/PyArrow). Rust UDFs offer maximum performance and can be synchronous scalar functions, asynchronous functions (for I/O operations like HTTP lookups), or aggregate functions (UDAFs). Python UDFs provide easier development with access to the Python ecosystem. UDFs are registered globally or locally to a pipeline and become callable within SQL queries.
Key aspects:
- Rust UDFs use the #[udf] proc macro and are dynamically compiled into shared libraries
- Python UDFs use a @udf decorator and run in sub-interpreters via PyO3
- Async UDFs enable non-blocking I/O operations within the streaming pipeline
- UDAFs (User-Defined Aggregate Functions) support custom aggregation logic with vector arguments
Usage
Execute this workflow when the built-in SQL functions do not cover your data transformation needs and you require custom processing logic. Common use cases include custom parsing (log formats, domain-specific protocols), external service lookups (geocoding, enrichment APIs), specialized aggregations (custom statistics, approximate algorithms), and format conversions (CBOR to JSON, custom serialization).
Execution Steps
Step 1: Write UDF Source Code
Write the UDF function source code in either Rust or Python. For Rust UDFs, annotate the function with the #[udf] macro and declare any external dependencies in a TOML block comment. The function signature determines the SQL function's argument types and return type. For Python UDFs, use the @udf decorator and specify return type annotations using PyArrow types. The function operates on individual values (scalar UDF) or batches (UDAF).
Key considerations:
- Rust UDFs support Option types for nullable arguments and return values
- Async Rust UDFs use the async keyword and can perform I/O (HTTP requests, database lookups)
- Python UDFs receive and return PyArrow arrays for batch processing efficiency
- External dependencies for Rust UDFs are declared in a block comment using TOML format
Step 2: Validate UDF
Submit the UDF source code to the validation endpoint. For Rust UDFs, the system parses the source file to extract the function signature, validates the #[udf] annotation, checks argument and return types against supported Arrow types, and performs a full compilation via the compiler service to verify the code builds successfully. For Python UDFs, the system validates the decorator, checks type annotations, and verifies the function can be loaded.
Key considerations:
- Validation performs full compilation for Rust UDFs to catch compile errors early
- The parsed function signature determines the SQL function name and argument types
- Compilation results are cached by content hash to speed up repeated validations
Step 3: Compile and Register UDF
Persist the UDF by creating a global UDF record in the database. For Rust UDFs, the compiler service generates a temporary Cargo crate containing the UDF source and dependencies, compiles it to a shared library via cargo build, validates the library can be loaded via dlopen, and uploads it to the configured storage backend. The database record includes the UDF source code, language, and the URL of the compiled dylib. For Python UDFs, only the source code is persisted (no compilation step required).
Key considerations:
- The compiler service runs as a separate gRPC service (or embedded in cluster mode)
- Compiled dylibs are stored in object storage (S3, GCS, or local filesystem)
- The compilation cache avoids rebuilding identical UDF source code
- Multiple UDFs can be defined in a single source file
Step 4: Use UDF in SQL Pipeline
Reference the registered UDF by name in SQL queries. During pipeline compilation, the schema provider loads all global UDFs and any pipeline-local UDFs, registers them with DataFusion's function registry, and makes them available for query planning. The planner resolves UDF calls, validates argument types, and includes them in the logical plan. At execution time, workers download the compiled dylib, load it via dlopen, and invoke the UDF for each record batch.
Key considerations:
- UDFs are invoked per-batch using Arrow columnar format for efficiency
- Async UDFs are executed in a dedicated operator that manages concurrency and result ordering
- UDAFs can be used in GROUP BY aggregations and window functions
- Pipeline-local UDFs (defined in the CREATE PIPELINE SQL) are compiled on-the-fly during pipeline creation
Step 5: Monitor and Iterate
Monitor the UDF's performance within the running pipeline via operator metrics. If the UDF needs modification, update the source code and create a new version. For pipeline-local UDFs, update the SQL definition and recreate the pipeline. For global UDFs, update the registration and restart affected pipelines to pick up the new version.
Key considerations:
- Async UDFs have configurable concurrency limits and timeout settings
- UDF errors are captured as operator errors visible in the WebUI
- Deleting a global UDF requires that no running pipelines reference it