Implementation:Eventual Inc Daft Read Hudi
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Lakehouse |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for reading Apache Hudi tables into a lazy distributed DataFrame provided by the Daft library.
Description
The read_hudi function creates a lazy DataFrame scan of an Apache Hudi table. It accepts a table URI (including remote object stores like s3:// and gs://) and an optional IO configuration. The function creates a HudiScanOperator that handles reading the Hudi table's metadata and data files. Multithreaded IO is automatically disabled when running on the Ray runner to reduce resource contention. The resulting DataFrame is lazy and supports predicate pushdown and partition pruning through Daft's query optimizer.
Usage
Import and use this function when you need to read data from an Apache Hudi table into a Daft DataFrame for analysis or further processing.
Code Reference
Source Location
- Repository: Daft
- File:
daft/io/hudi/_hudi.py - Lines: L13-52
Signature
def read_hudi(
table_uri: str,
io_config: IOConfig | None = None,
) -> DataFrame
Import
from daft import read_hudi
# or
import daft
df = daft.read_hudi(uri)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| table_uri | str | Yes | URI to the Hudi table (supports remote URLs such as s3:// or gs://)
|
| io_config | None | No | Custom IO configuration for accessing Hudi table object storage data; defaults to Daft context config |
Outputs
| Name | Type | Description |
|---|---|---|
| return | DataFrame | A lazy DataFrame with the schema converted from the Hudi table |
Usage Examples
Basic Usage
import daft
# Read a Hudi table from a local path
df = daft.read_hudi("some-table-uri")
df = df.where(df["foo"] > 5)
df.show()
# Read a Hudi table from S3 with custom IO config
from daft.io import S3Config, IOConfig
io_config = IOConfig(s3=S3Config(region="us-west-2", anonymous=True))
df = daft.read_hudi("s3://bucket/path/to/hudi_table/", io_config=io_config)
df.show()