Principle:ArroyoSystems Arroyo UDF Compilation
Summary
This principle covers compiling validated UDF source code into loadable dynamic libraries and registering them for use in SQL pipelines. The compilation pipeline transforms user-provided function definitions into distributable artifacts that streaming workers can load at runtime.
Core Concept
The UDF compilation pipeline consists of four stages:
- Write crate structure -- Generate a complete Rust crate containing the UDF source, dependencies, and macro annotations
- Cargo build -- Compile the crate into a dynamic library (
.soor.dylib) - Upload dylib to storage -- Persist the compiled artifact to object storage for distribution to worker nodes
- Persist metadata to database -- Record the UDF's name, signature, artifact URL, and other metadata in the system database
Theoretical Basis
Dynamic Code Loading
Dynamic code loading enables runtime extensibility without recompiling or redeploying the streaming engine itself. UDFs are compiled separately and loaded into worker processes via dlopen2, allowing users to extend the system's capabilities on demand.
Sandboxed Compilation
Compilation is mutex-guarded to ensure only one compilation runs at a time. This prevents resource contention on the compilation server and avoids race conditions when writing to the temporary crate directory.
Content-Addressed Storage
The dynamic library path is derived from a hash of the UDF definition. This provides natural caching: if a UDF with an identical definition has already been compiled, the existing artifact can be reused without recompilation. The content-addressing scheme ensures that:
- Identical definitions produce identical artifact paths
- Modified definitions produce different paths, preventing stale artifact reuse
- Multiple versions of the same UDF can coexist in storage
Artifact Distribution
Compiled dynamic libraries are uploaded to object storage (e.g., S3, GCS, or local filesystem) for distribution to streaming workers. When a pipeline starts, workers download the required UDF artifacts from storage and load them into their process space.
Metadata Registration
The final stage persists metadata records in the system database, linking UDF names to their compiled artifacts. This metadata includes:
- UDF name and description
- Source definition
- Language (Rust or Python)
- Dylib URL in object storage
- Creation and update timestamps
Compilation Pipeline Flow
- Input: Validated UDF source code (from the validation stage)
- Stage 1: Generate a Rust crate with the UDF source,
Cargo.toml(including extracted dependencies), and the#[udf]macro annotation - Stage 2: Run
cargo build --releaseto produce a.so/.dylib - Stage 3: Upload the artifact to object storage at a content-addressed path
- Stage 4: Insert or update the UDF metadata record in the database
- Output: A
GlobalUdfrecord with the artifact URL and metadata
Design Considerations
- Compilation latency: Full Rust compilation can take tens of seconds. The content-addressed caching mitigates this for repeated submissions of the same definition.
- Build isolation: Each UDF is compiled in its own temporary crate directory to prevent interference between concurrent compilations (further guarded by the mutex).
- Cross-platform artifacts: Dynamic libraries are platform-specific. The compilation must occur on the same platform (architecture + OS) as the target worker nodes.
- Dependency security: UDF dependencies are user-specified and pulled from crates.io. The system does not currently sandbox the compilation environment beyond mutex exclusion.