Principle:Eventual Inc Daft Hudi Reading
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Lakehouse |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Hudi reading is the technique for creating a lazy DataFrame scan of an Apache Hudi table, supporting record-level operations on columnar storage.
Description
Hudi reading creates a lazy DataFrame scan of an Apache Hudi table. Apache Hudi provides record-level insert, update, and delete capabilities on top of columnar storage formats like Parquet. The scan operator reads the Hudi table's metadata to identify active data files and applies the correct merge logic depending on the table type (Copy-on-Write or Merge-on-Read). The resulting DataFrame is lazy and supports predicate pushdown and partition pruning through Daft's query optimizer.
Usage
Use Hudi reading when you need to read data from an Apache Hudi table. This is appropriate for workloads where data is managed with Hudi's upsert and incremental processing capabilities, such as CDC (Change Data Capture) pipelines and streaming data lake ingestion.
Theoretical Basis
Apache Hudi is a table format that adds record-level mutation capabilities to data lakes. Key concepts:
- Copy-on-Write (CoW): Updates rewrite entire data files, providing fast read performance at the cost of write amplification.
- Merge-on-Read (MoR): Updates are written to delta log files and merged at read time, providing fast writes at the cost of read-time merge overhead.
- Timeline: An ordered sequence of actions (commits, compactions, cleans) that tracks the table's history.
- Upsert: Record-level insert-or-update operations identified by a record key.
1. Resolve the Hudi table from the provided URI
2. Read the Hudi timeline to determine the latest committed state
3. Identify active data files (and delta logs for MoR tables)
4. Create a lazy scan operator with appropriate merge strategy
5. Execute the scan with predicate pushdown when an action is triggered