Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Eventual Inc Daft Read Hudi

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Data_Lakehouse
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for reading Apache Hudi tables into a lazy distributed DataFrame provided by the Daft library.

Description

The read_hudi function creates a lazy DataFrame scan of an Apache Hudi table. It accepts a table URI (including remote object stores like s3:// and gs://) and an optional IO configuration. The function creates a HudiScanOperator that handles reading the Hudi table's metadata and data files. Multithreaded IO is automatically disabled when running on the Ray runner to reduce resource contention. The resulting DataFrame is lazy and supports predicate pushdown and partition pruning through Daft's query optimizer.

Usage

Import and use this function when you need to read data from an Apache Hudi table into a Daft DataFrame for analysis or further processing.

Code Reference

Source Location

  • Repository: Daft
  • File: daft/io/hudi/_hudi.py
  • Lines: L13-52

Signature

def read_hudi(
    table_uri: str,
    io_config: IOConfig | None = None,
) -> DataFrame

Import

from daft import read_hudi

# or
import daft
df = daft.read_hudi(uri)

I/O Contract

Inputs

Name Type Required Description
table_uri str Yes URI to the Hudi table (supports remote URLs such as s3:// or gs://)
io_config None No Custom IO configuration for accessing Hudi table object storage data; defaults to Daft context config

Outputs

Name Type Description
return DataFrame A lazy DataFrame with the schema converted from the Hudi table

Usage Examples

Basic Usage

import daft

# Read a Hudi table from a local path
df = daft.read_hudi("some-table-uri")
df = df.where(df["foo"] > 5)
df.show()

# Read a Hudi table from S3 with custom IO config
from daft.io import S3Config, IOConfig
io_config = IOConfig(s3=S3Config(region="us-west-2", anonymous=True))
df = daft.read_hudi("s3://bucket/path/to/hudi_table/", io_config=io_config)
df.show()

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment