Implementation:Eventual Inc Daft Read Iceberg
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Lakehouse |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for reading Apache Iceberg tables into a lazy distributed DataFrame provided by the Daft library.
Description
The read_iceberg function creates a lazy DataFrame scan of an Apache Iceberg table. It accepts either a string path to an Iceberg metadata file or a PyIceberg Table object. When given a string, it uses StaticTable.from_metadata() to load the table. IO configuration is resolved from the table's file IO properties if not explicitly provided. The function creates an IcebergScanOperator that handles partition pruning and predicate pushdown through the Iceberg metadata layer. Multithreaded IO is automatically disabled when running on the Ray runner to limit resource contention.
Usage
Import and use this function when you need to read data from an Apache Iceberg table with snapshot isolation and optional time travel via snapshot IDs.
Code Reference
Source Location
- Repository: Daft
- File:
daft/io/iceberg/_iceberg.py - Lines: L56-114
Signature
def read_iceberg(
table: Union[str, "PyIcebergTable"],
snapshot_id: int | None = None,
io_config: IOConfig | None = None,
) -> DataFrame
Import
from daft import read_iceberg
# or
import daft
df = daft.read_iceberg(table)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| table | PyIcebergTable | Yes | Path to an Iceberg metadata file (supports s3://, gs://) or a PyIceberg Table instance
|
| snapshot_id | None | No | Specific snapshot ID to query for time travel; defaults to latest snapshot |
| io_config | None | No | Custom IO configuration for accessing object storage; defaults to table's file IO properties |
Outputs
| Name | Type | Description |
|---|---|---|
| return | DataFrame | A lazy DataFrame with the schema converted from the Iceberg table, supporting predicate pushdown and partition pruning |
Usage Examples
Basic Usage
import daft
# Read from a PyIceberg table object
df = daft.read_iceberg(pyiceberg_table)
# Apply filters (pushed down to Iceberg metadata layer)
df = df.where(df["category"] == "electronics")
df.show()
# Read with time travel to a specific snapshot
df = daft.read_iceberg(pyiceberg_table, snapshot_id=123456789)
# Read from a metadata file path with custom IO config
from daft.io import S3Config, IOConfig
io_config = IOConfig(s3=S3Config(region="us-west-2", anonymous=True))
df = daft.read_iceberg("s3://bucket/path/to/metadata.json", io_config=io_config)