Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Eventual Inc Daft Hudi Reading

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Data_Lakehouse
Last Updated 2026-02-08 00:00 GMT

Overview

Hudi reading is the technique for creating a lazy DataFrame scan of an Apache Hudi table, supporting record-level operations on columnar storage.

Description

Hudi reading creates a lazy DataFrame scan of an Apache Hudi table. Apache Hudi provides record-level insert, update, and delete capabilities on top of columnar storage formats like Parquet. The scan operator reads the Hudi table's metadata to identify active data files and applies the correct merge logic depending on the table type (Copy-on-Write or Merge-on-Read). The resulting DataFrame is lazy and supports predicate pushdown and partition pruning through Daft's query optimizer.

Usage

Use Hudi reading when you need to read data from an Apache Hudi table. This is appropriate for workloads where data is managed with Hudi's upsert and incremental processing capabilities, such as CDC (Change Data Capture) pipelines and streaming data lake ingestion.

Theoretical Basis

Apache Hudi is a table format that adds record-level mutation capabilities to data lakes. Key concepts:

  • Copy-on-Write (CoW): Updates rewrite entire data files, providing fast read performance at the cost of write amplification.
  • Merge-on-Read (MoR): Updates are written to delta log files and merged at read time, providing fast writes at the cost of read-time merge overhead.
  • Timeline: An ordered sequence of actions (commits, compactions, cleans) that tracks the table's history.
  • Upsert: Record-level insert-or-update operations identified by a record key.
1. Resolve the Hudi table from the provided URI
2. Read the Hudi timeline to determine the latest committed state
3. Identify active data files (and delta logs for MoR tables)
4. Create a lazy scan operator with appropriate merge strategy
5. Execute the scan with predicate pushdown when an action is triggered

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment