Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ArroyoSystems Arroyo UDF Compilation

From Leeroopedia


Template:Principle

Summary

This principle covers compiling validated UDF source code into loadable dynamic libraries and registering them for use in SQL pipelines. The compilation pipeline transforms user-provided function definitions into distributable artifacts that streaming workers can load at runtime.

Core Concept

The UDF compilation pipeline consists of four stages:

  1. Write crate structure -- Generate a complete Rust crate containing the UDF source, dependencies, and macro annotations
  2. Cargo build -- Compile the crate into a dynamic library (.so or .dylib)
  3. Upload dylib to storage -- Persist the compiled artifact to object storage for distribution to worker nodes
  4. Persist metadata to database -- Record the UDF's name, signature, artifact URL, and other metadata in the system database

Theoretical Basis

Dynamic Code Loading

Dynamic code loading enables runtime extensibility without recompiling or redeploying the streaming engine itself. UDFs are compiled separately and loaded into worker processes via dlopen2, allowing users to extend the system's capabilities on demand.

Sandboxed Compilation

Compilation is mutex-guarded to ensure only one compilation runs at a time. This prevents resource contention on the compilation server and avoids race conditions when writing to the temporary crate directory.

Content-Addressed Storage

The dynamic library path is derived from a hash of the UDF definition. This provides natural caching: if a UDF with an identical definition has already been compiled, the existing artifact can be reused without recompilation. The content-addressing scheme ensures that:

  • Identical definitions produce identical artifact paths
  • Modified definitions produce different paths, preventing stale artifact reuse
  • Multiple versions of the same UDF can coexist in storage

Artifact Distribution

Compiled dynamic libraries are uploaded to object storage (e.g., S3, GCS, or local filesystem) for distribution to streaming workers. When a pipeline starts, workers download the required UDF artifacts from storage and load them into their process space.

Metadata Registration

The final stage persists metadata records in the system database, linking UDF names to their compiled artifacts. This metadata includes:

  • UDF name and description
  • Source definition
  • Language (Rust or Python)
  • Dylib URL in object storage
  • Creation and update timestamps

Compilation Pipeline Flow

  • Input: Validated UDF source code (from the validation stage)
  • Stage 1: Generate a Rust crate with the UDF source, Cargo.toml (including extracted dependencies), and the #[udf] macro annotation
  • Stage 2: Run cargo build --release to produce a .so/.dylib
  • Stage 3: Upload the artifact to object storage at a content-addressed path
  • Stage 4: Insert or update the UDF metadata record in the database
  • Output: A GlobalUdf record with the artifact URL and metadata

Design Considerations

  • Compilation latency: Full Rust compilation can take tens of seconds. The content-addressed caching mitigates this for repeated submissions of the same definition.
  • Build isolation: Each UDF is compiled in its own temporary crate directory to prevent interference between concurrent compilations (further guarded by the mutex).
  • Cross-platform artifacts: Dynamic libraries are platform-specific. The compilation must occur on the same platform (architecture + OS) as the target worker nodes.
  • Dependency security: UDF dependencies are user-specified and pulled from crates.io. The system does not currently sandbox the compilation environment beyond mutex exclusion.

Related Implementation

Implementation:ArroyoSystems_Arroyo_Compile_UDF

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment