Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ArroyoSystems Arroyo UDF Validation

From Leeroopedia


Template:Principle

Summary

This principle covers validating UDF source code before compilation in the Arroyo streaming engine. Validation is the critical pre-compilation step that parses user-provided function definitions, extracts metadata (signatures, types, dependencies), and performs lightweight compilation checks to provide fast feedback on errors.

Core Concept

UDF validation is a multi-stage pipeline that transforms raw user source code into a structured representation suitable for compilation. The pipeline operates on the source text without producing final compiled artifacts, enabling rapid iteration during UDF development.

The validation stages are:

  1. Lexical and syntactic analysis -- Parse the source code to extract function signatures
  2. Type checking -- Map user-facing types to Arrow DataTypes
  3. Async detection -- Determine whether the function is synchronous or asynchronous
  4. Dependency extraction -- Parse TOML dependency declarations from comment blocks
  5. Compilation check -- Run cargo check to verify the code compiles without producing artifacts

Theoretical Basis

Static Validation for Early Error Detection

Static validation of user-provided code before full compilation serves a critical role in the UDF development workflow. By catching errors early -- at the parsing and type-checking stages -- the system provides fast feedback without incurring the cost of a full compilation cycle.

Parsing Strategy

The system uses language-specific parsing tools:

Language Parser Purpose
Rust syn crate Full syntactic analysis of function items, extracting parameter names, types, return type, and async markers
Python AST parsing Extract function signatures and type annotations from decorated functions

Type Mapping

User-facing types must be mapped to Arrow DataTypes for integration with the columnar execution engine:

Rust Type Python Type Arrow DataType
i64 int Int64
f64 float Float64
String str Utf8
bool bool Boolean
Option<T> Optional[T] Nullable variant of mapped type

Dependency Resolution

Rust UDFs can declare external crate dependencies via TOML blocks embedded in comments at the top of the source file. The validation pipeline extracts these using TOML parsing and includes them in the generated Cargo.toml for the UDF crate.

Lightweight Compilation Check

The final validation step runs cargo check, which performs full type checking and borrow checking without producing compiled artifacts. This is significantly faster than a full cargo build and catches semantic errors that syntactic parsing alone cannot detect.

Validation Pipeline Flow

The validation pipeline proceeds as follows:

  • Input: Raw UDF source code as a string
  • Stage 1: Parse source to extract function item(s)
  • Stage 2: Extract function name, parameters, return type, and async marker
  • Stage 3: Map all parameter and return types to Arrow DataTypes
  • Stage 4: Extract TOML dependency block from leading comments (if present)
  • Stage 5: Construct a temporary crate and run cargo check
  • Output: A ParsedUdfFile containing the validated function metadata, or a descriptive error

Design Considerations

  • Fast feedback: Validation should complete in seconds, not minutes, to support interactive UDF development in the web UI
  • Descriptive errors: Error messages from each stage should clearly indicate what failed and where, including line numbers when available
  • Language parity: Both Rust and Python UDFs should go through equivalent validation stages, though the implementation details differ by language

Related Implementation

Implementation:ArroyoSystems_Arroyo_Validate_UDF

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment