Workflow:Eventual Inc Daft Data Lakehouse ETL

Knowledge Sources	Daft Daft Docs Iceberg Connector Delta Lake Connector Hudi Connector
Domains	Data_Engineering, Lakehouse, ETL
Last Updated	2026-02-08 14:00 GMT

Overview

End-to-end process for extracting data from lakehouse table formats (Iceberg, Delta Lake, Hudi), transforming it with Daft's DataFrame API, and loading results back into lakehouse tables with ACID guarantees.

Description

This workflow covers the standard Extract-Transform-Load (ETL) pattern using Daft as the compute engine and open lakehouse formats as the storage layer. Daft natively integrates with Apache Iceberg, Delta Lake, and Apache Hudi for reading and writing, leveraging partition pruning and data skipping optimizations for efficient queries. The workflow supports multi-cloud storage backends (S3, GCS, Azure, local filesystem) and provides catalog integration through PyIceberg, delta-rs, and AWS Glue. Transformations use Daft's lazy DataFrame API with query optimization, supporting joins, aggregations, window functions, and column-level operations before writing results back with append or overwrite semantics.

Usage

Execute this workflow when you need to build or maintain data pipelines that read from and write to lakehouse table formats. Typical triggers include:

Building ETL pipelines that transform raw data into curated analytical tables
Incrementally appending new data to existing Iceberg or Delta Lake tables
Running periodic aggregations or denormalization jobs on lakehouse data
Migrating data between formats (e.g., Parquet files to Iceberg tables)
Performing cross-table joins and transformations within a lakehouse architecture

Execution Steps

Step 1: Catalog and Table Registration

Connect to the data catalog that manages your lakehouse tables. For Iceberg, load the catalog via PyIceberg (supporting REST, Glue, Hive, SQL catalogs). For Delta Lake, provide the table URI directly. For Hudi, point to the table's base path. Optionally, register catalogs with Daft's session system for SQL access.

Key considerations:

Iceberg requires a PyIceberg catalog object (load_catalog from pyiceberg)
Delta Lake tables are addressed by URI path (local, S3, GCS, Azure)
Configure IOConfig with cloud credentials (S3Config, GCSConfig, AzureConfig) if needed
Use daft.attach_catalog() to make tables available via SQL queries
Hudi tables are read-only in Daft (no write support)

Step 2: Data Extraction

Read source tables into Daft DataFrames using the appropriate reader function. Daft creates a lazy execution plan that leverages the table format's metadata for partition pruning and file-level data skipping. Only the files matching query predicates are actually read.

Key considerations:

Use daft.read_iceberg(table) for Iceberg tables (pass the PyIceberg table object)
Use daft.read_deltalake(uri) for Delta Lake tables
Use daft.read_hudi(path) for Hudi tables
Apply where() filters early to enable partition pruning
Iceberg supports time travel via snapshot_id parameter
Column projection (select) pushes down to avoid reading unnecessary columns

Step 3: Data Transformation

Apply transformations using Daft's DataFrame API. All operations are lazy and compose into an optimized query plan. Daft supports column projections, filters, joins (hash, sort-merge, broadcast), aggregations (groupby, global), window functions (rank, lag, lead, running aggregates), pivot/unpivot, explode for nested data, and arbitrary expression chains.

Key considerations:

Chain with_column() for adding or transforming columns
Use groupby().agg() for aggregations with multiple aggregate functions
Use join() with appropriate strategy hints for multi-table operations
Use Window() specifications for ranking, running totals, and lag/lead operations
Apply sort() for ordered output when writing partitioned tables
Use distinct() or drop_duplicates() for deduplication

Step 4: Data Quality Validation

Verify the transformed data meets quality requirements before writing. Use describe() or summarize() for statistical profiling, count_rows() for row counts, and conditional filters to identify data quality issues.

Key considerations:

Use show() to visually inspect a sample of transformed data
Use describe() for summary statistics across all columns
Apply where() with quality predicates to filter out bad records
Use count_rows() to verify expected output volume
Use explain() to review the optimized query plan before execution

Step 5: Data Loading

Write the transformed DataFrame to the target lakehouse table. Daft supports append and overwrite write modes for Iceberg and Delta Lake. The write operation is atomic and returns a metadata DataFrame describing the files written.

Key considerations:

Use df.write_iceberg(table, mode="append") to add rows to an existing Iceberg table
Use df.write_iceberg(table, mode="overwrite") to replace all data in an Iceberg table
Use df.write_deltalake(uri, mode="append"|"overwrite") for Delta Lake tables
For file-based output, use write_parquet() or write_csv() with partition_cols for Hive-style partitioning
The write operation returns a DataFrame with metadata (operation, rows, file sizes, file paths)
Writes are transactional for Iceberg and Delta Lake (ACID guarantees)

Step 6: Post-Write Verification

After writing, verify the data was persisted correctly by reading it back and comparing row counts or sampling records. For Iceberg, verify the new snapshot was created. For Delta Lake, check the version history.

Key considerations:

Re-read the written table and compare count_rows() with expected values
Use show() on the write metadata DataFrame to inspect file-level details
For Iceberg, the new snapshot ID is available in the write metadata
For production pipelines, integrate with monitoring and alerting systems

Execution Diagram

GitHub URL

Workflow Repository