Workflow:ArroyoSystems Arroyo UDF Development

Knowledge Sources	Arroyo Arroyo Docs
Domains	Stream_Processing, SQL, UDFs
Last Updated	2026-02-08 09:00 GMT

Overview

End-to-end process for writing, compiling, registering, and using user-defined functions (UDFs) in Arroyo SQL pipelines, supporting both Rust and Python languages with synchronous and asynchronous execution modes.

Description

This workflow covers the creation of custom functions that extend Arroyo's SQL capabilities beyond the built-in function library. Arroyo supports UDFs written in Rust (compiled to shared libraries at runtime) and Python (executed via PyO3/PyArrow). Rust UDFs offer maximum performance and can be synchronous scalar functions, asynchronous functions (for I/O operations like HTTP lookups), or aggregate functions (UDAFs). Python UDFs provide easier development with access to the Python ecosystem. UDFs are registered globally or locally to a pipeline and become callable within SQL queries.

Key aspects:

Rust UDFs use the #[udf] proc macro and are dynamically compiled into shared libraries
Python UDFs use a @udf decorator and run in sub-interpreters via PyO3
Async UDFs enable non-blocking I/O operations within the streaming pipeline
UDAFs (User-Defined Aggregate Functions) support custom aggregation logic with vector arguments

Usage

Execute this workflow when the built-in SQL functions do not cover your data transformation needs and you require custom processing logic. Common use cases include custom parsing (log formats, domain-specific protocols), external service lookups (geocoding, enrichment APIs), specialized aggregations (custom statistics, approximate algorithms), and format conversions (CBOR to JSON, custom serialization).

Execution Steps

Step 1: Write UDF Source Code

Write the UDF function source code in either Rust or Python. For Rust UDFs, annotate the function with the #[udf] macro and declare any external dependencies in a TOML block comment. The function signature determines the SQL function's argument types and return type. For Python UDFs, use the @udf decorator and specify return type annotations using PyArrow types. The function operates on individual values (scalar UDF) or batches (UDAF).

Key considerations:

Rust UDFs support Option types for nullable arguments and return values
Async Rust UDFs use the async keyword and can perform I/O (HTTP requests, database lookups)
Python UDFs receive and return PyArrow arrays for batch processing efficiency
External dependencies for Rust UDFs are declared in a block comment using TOML format

Step 2: Validate UDF

Submit the UDF source code to the validation endpoint. For Rust UDFs, the system parses the source file to extract the function signature, validates the #[udf] annotation, checks argument and return types against supported Arrow types, and performs a full compilation via the compiler service to verify the code builds successfully. For Python UDFs, the system validates the decorator, checks type annotations, and verifies the function can be loaded.

Key considerations:

Validation performs full compilation for Rust UDFs to catch compile errors early
The parsed function signature determines the SQL function name and argument types
Compilation results are cached by content hash to speed up repeated validations

Step 3: Compile and Register UDF

Persist the UDF by creating a global UDF record in the database. For Rust UDFs, the compiler service generates a temporary Cargo crate containing the UDF source and dependencies, compiles it to a shared library via cargo build, validates the library can be loaded via dlopen, and uploads it to the configured storage backend. The database record includes the UDF source code, language, and the URL of the compiled dylib. For Python UDFs, only the source code is persisted (no compilation step required).

Key considerations:

The compiler service runs as a separate gRPC service (or embedded in cluster mode)
Compiled dylibs are stored in object storage (S3, GCS, or local filesystem)
The compilation cache avoids rebuilding identical UDF source code
Multiple UDFs can be defined in a single source file

Step 4: Use UDF in SQL Pipeline

Reference the registered UDF by name in SQL queries. During pipeline compilation, the schema provider loads all global UDFs and any pipeline-local UDFs, registers them with DataFusion's function registry, and makes them available for query planning. The planner resolves UDF calls, validates argument types, and includes them in the logical plan. At execution time, workers download the compiled dylib, load it via dlopen, and invoke the UDF for each record batch.

Key considerations:

UDFs are invoked per-batch using Arrow columnar format for efficiency
Async UDFs are executed in a dedicated operator that manages concurrency and result ordering
UDAFs can be used in GROUP BY aggregations and window functions
Pipeline-local UDFs (defined in the CREATE PIPELINE SQL) are compiled on-the-fly during pipeline creation

Step 5: Monitor and Iterate

Monitor the UDF's performance within the running pipeline via operator metrics. If the UDF needs modification, update the source code and create a new version. For pipeline-local UDFs, update the SQL definition and recreate the pipeline. For global UDFs, update the registration and restart affected pipelines to pick up the new version.

Key considerations:

Async UDFs have configurable concurrency limits and timeout settings
UDF errors are captured as operator errors visible in the WebUI
Deleting a global UDF requires that no running pipelines reference it

Execution Diagram

GitHub URL

Workflow Repository