Principle:ArroyoSystems Arroyo UDF Compilation

Summary

This principle covers compiling validated UDF source code into loadable dynamic libraries and registering them for use in SQL pipelines. The compilation pipeline transforms user-provided function definitions into distributable artifacts that streaming workers can load at runtime.

Core Concept

The UDF compilation pipeline consists of four stages:

Write crate structure -- Generate a complete Rust crate containing the UDF source, dependencies, and macro annotations
Cargo build -- Compile the crate into a dynamic library (.so or .dylib)
Upload dylib to storage -- Persist the compiled artifact to object storage for distribution to worker nodes
Persist metadata to database -- Record the UDF's name, signature, artifact URL, and other metadata in the system database

Theoretical Basis

Dynamic Code Loading

Dynamic code loading enables runtime extensibility without recompiling or redeploying the streaming engine itself. UDFs are compiled separately and loaded into worker processes via dlopen2, allowing users to extend the system's capabilities on demand.

Sandboxed Compilation

Compilation is mutex-guarded to ensure only one compilation runs at a time. This prevents resource contention on the compilation server and avoids race conditions when writing to the temporary crate directory.

Content-Addressed Storage

The dynamic library path is derived from a hash of the UDF definition. This provides natural caching: if a UDF with an identical definition has already been compiled, the existing artifact can be reused without recompilation. The content-addressing scheme ensures that:

Identical definitions produce identical artifact paths
Modified definitions produce different paths, preventing stale artifact reuse
Multiple versions of the same UDF can coexist in storage

Artifact Distribution

Compiled dynamic libraries are uploaded to object storage (e.g., S3, GCS, or local filesystem) for distribution to streaming workers. When a pipeline starts, workers download the required UDF artifacts from storage and load them into their process space.

Metadata Registration

The final stage persists metadata records in the system database, linking UDF names to their compiled artifacts. This metadata includes:

UDF name and description
Source definition
Language (Rust or Python)
Dylib URL in object storage
Creation and update timestamps

Compilation Pipeline Flow

Input: Validated UDF source code (from the validation stage)
Stage 1: Generate a Rust crate with the UDF source, Cargo.toml (including extracted dependencies), and the #[udf] macro annotation
Stage 2: Run cargo build --release to produce a .so/.dylib
Stage 3: Upload the artifact to object storage at a content-addressed path
Stage 4: Insert or update the UDF metadata record in the database
Output: A GlobalUdf record with the artifact URL and metadata

Design Considerations

Compilation latency: Full Rust compilation can take tens of seconds. The content-addressed caching mitigates this for repeated submissions of the same definition.
Build isolation: Each UDF is compiled in its own temporary crate directory to prevent interference between concurrent compilations (further guarded by the mutex).
Cross-platform artifacts: Dynamic libraries are platform-specific. The compilation must occur on the same platform (architecture + OS) as the target worker nodes.
Dependency security: UDF dependencies are user-specified and pulled from crates.io. The system does not currently sandbox the compilation environment beyond mutex exclusion.

Related Implementation

Implementation:ArroyoSystems_Arroyo_Compile_UDF

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment