Principle:Eventual Inc Daft Hudi Reading

Knowledge Sources	Daft Daft Docs
Domains	Data_Engineering, Data_Lakehouse
Last Updated	2026-02-08 00:00 GMT

Overview

Hudi reading is the technique for creating a lazy DataFrame scan of an Apache Hudi table, supporting record-level operations on columnar storage.

Description

Hudi reading creates a lazy DataFrame scan of an Apache Hudi table. Apache Hudi provides record-level insert, update, and delete capabilities on top of columnar storage formats like Parquet. The scan operator reads the Hudi table's metadata to identify active data files and applies the correct merge logic depending on the table type (Copy-on-Write or Merge-on-Read). The resulting DataFrame is lazy and supports predicate pushdown and partition pruning through Daft's query optimizer.

Usage

Use Hudi reading when you need to read data from an Apache Hudi table. This is appropriate for workloads where data is managed with Hudi's upsert and incremental processing capabilities, such as CDC (Change Data Capture) pipelines and streaming data lake ingestion.

Theoretical Basis

Apache Hudi is a table format that adds record-level mutation capabilities to data lakes. Key concepts:

Copy-on-Write (CoW): Updates rewrite entire data files, providing fast read performance at the cost of write amplification.
Merge-on-Read (MoR): Updates are written to delta log files and merged at read time, providing fast writes at the cost of read-time merge overhead.
Timeline: An ordered sequence of actions (commits, compactions, cleans) that tracks the table's history.
Upsert: Record-level insert-or-update operations identified by a record key.

1. Resolve the Hudi table from the provided URI
2. Read the Hudi timeline to determine the latest committed state
3. Identify active data files (and delta logs for MoR tables)
4. Create a lazy scan operator with appropriate merge strategy
5. Execute the scan with predicate pushdown when an action is triggered

Related Pages

Implemented By

Implementation:Eventual_Inc_Daft_Read_Hudi

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment