Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ArroyoSystems Arroyo UDF Authoring

From Leeroopedia


Template:Principle

Summary

This principle covers writing User-Defined Functions (UDFs) for extending SQL query capabilities in the Arroyo streaming engine. UDFs allow users to define custom scalar logic that can be invoked directly from SQL queries, augmenting the built-in function set with arbitrary computation. Arroyo supports both Rust and Python UDFs, with sync and async variants.

Core Concept

The fundamental pattern for UDF authoring relies on macros and decorators as metaprogramming tools. Users annotate a standard function with a language-specific marker:

  • Rust: The #[udf] proc macro attribute
  • Python: The @udf decorator

These annotations trigger automatic code generation that produces the necessary FFI (Foreign Function Interface) glue code for seamless integration with the streaming engine's runtime.

Theoretical Basis

UDFs extend a query engine's built-in function set by allowing user-provided logic to participate in the query evaluation pipeline. The macro/decorator pattern leverages metaprogramming to accomplish several critical tasks:

FFI Entry Point Generation

The macro/decorator generates FFI entry points that handle conversion between Arrow columnar format and native language types. This conversion is essential because the streaming engine operates on Arrow arrays internally, while user functions operate on scalar native types.

Async Execution Support

For I/O-bound UDFs (e.g., functions that call external APIs or databases), the system supports async variants. Async UDFs generate additional FFI entry points for lifecycle management:

  • __start -- initialize the async runtime
  • __send -- submit a row for processing
  • __drain_results -- collect completed results
  • __stop_runtime -- shut down the async runtime

Type Safety Across FFI

The macro system enforces type safety across the FFI boundary by:

  • Parsing the function signature at compile time
  • Mapping Rust/Python types to corresponding Arrow DataTypes
  • Generating type-checked conversion code in both directions

Dynamic Library Loading

Rust UDFs compile to dynamic libraries (.so on Linux, .dylib on macOS) that are loaded at runtime via dlopen2. This enables runtime extensibility without recompiling the engine.

Supported Languages

Language Annotation Sync Async Output Format
Rust #[udf] proc macro Yes Yes Dynamic library (.so/.dylib)
Python @udf decorator Yes Yes Python module (interpreted)

Usage Pattern

Rust UDF

#[udf]
fn my_udf(x: i64, y: String) -> Option<String> {
    Some(format!("{}: {}", y, x))
}

Python UDF

@udf
def my_udf(x: int, y: str) -> str:
    return f"{y}: {x}"

Design Considerations

  • Nullability: Returning Option<T> in Rust (or Optional[T] in Python) signals that the UDF may produce null values, which maps to nullable Arrow arrays.
  • Dependencies: Rust UDFs can declare external crate dependencies via TOML comment blocks, which are extracted and included in the generated Cargo.toml.
  • Error Handling: UDF errors are captured and surfaced through the validation and compilation pipeline rather than crashing the streaming engine.

Related Implementation

Implementation:ArroyoSystems_Arroyo_UDF_Macro

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment