Principle:Eventual Inc Daft Batch UDF
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, User_Defined_Functions |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Technique for applying custom Python functions to batches of rows using columnar Series input.
Description
Batch UDFs receive daft.Series objects (columnar batches) rather than individual values, enabling efficient vectorized operations using libraries like NumPy, PyArrow, or pandas. The return type must be explicitly specified via the return_dtype parameter. Batch functions can return a daft.Series, list, numpy.ndarray, or pyarrow.Array. Scalar arguments are passed through without modification, allowing mixed Series and scalar inputs.
Usage
Use batch UDFs when you need vectorized batch processing for performance-critical custom operations. Batch processing amortizes per-row overhead and leverages columnar computation libraries for significantly higher throughput compared to row-wise UDFs.
Theoretical Basis
Batch UDFs implement a vectorized map operation applying f(Series) -> Series across partitioned columnar data. By operating on columnar batches rather than individual rows, the per-row function call overhead is amortized, and the underlying computation can exploit SIMD instructions, cache locality, and vectorized library kernels.
for each partition P in DataFrame:
for each batch B in P (sized by batch_size):
output_batch = f(B.col1_series, B.col2_series, ...)
append output_batch to output