Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Eventual Inc Daft Data Lakehouse ETL

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Lakehouse, ETL
Last Updated 2026-02-08 14:00 GMT

Overview

End-to-end process for extracting data from lakehouse table formats (Iceberg, Delta Lake, Hudi), transforming it with Daft's DataFrame API, and loading results back into lakehouse tables with ACID guarantees.

Description

This workflow covers the standard Extract-Transform-Load (ETL) pattern using Daft as the compute engine and open lakehouse formats as the storage layer. Daft natively integrates with Apache Iceberg, Delta Lake, and Apache Hudi for reading and writing, leveraging partition pruning and data skipping optimizations for efficient queries. The workflow supports multi-cloud storage backends (S3, GCS, Azure, local filesystem) and provides catalog integration through PyIceberg, delta-rs, and AWS Glue. Transformations use Daft's lazy DataFrame API with query optimization, supporting joins, aggregations, window functions, and column-level operations before writing results back with append or overwrite semantics.

Usage

Execute this workflow when you need to build or maintain data pipelines that read from and write to lakehouse table formats. Typical triggers include:

  • Building ETL pipelines that transform raw data into curated analytical tables
  • Incrementally appending new data to existing Iceberg or Delta Lake tables
  • Running periodic aggregations or denormalization jobs on lakehouse data
  • Migrating data between formats (e.g., Parquet files to Iceberg tables)
  • Performing cross-table joins and transformations within a lakehouse architecture

Execution Steps

Step 1: Catalog and Table Registration

Connect to the data catalog that manages your lakehouse tables. For Iceberg, load the catalog via PyIceberg (supporting REST, Glue, Hive, SQL catalogs). For Delta Lake, provide the table URI directly. For Hudi, point to the table's base path. Optionally, register catalogs with Daft's session system for SQL access.

Key considerations:

  • Iceberg requires a PyIceberg catalog object (load_catalog from pyiceberg)
  • Delta Lake tables are addressed by URI path (local, S3, GCS, Azure)
  • Configure IOConfig with cloud credentials (S3Config, GCSConfig, AzureConfig) if needed
  • Use daft.attach_catalog() to make tables available via SQL queries
  • Hudi tables are read-only in Daft (no write support)

Step 2: Data Extraction

Read source tables into Daft DataFrames using the appropriate reader function. Daft creates a lazy execution plan that leverages the table format's metadata for partition pruning and file-level data skipping. Only the files matching query predicates are actually read.

Key considerations:

  • Use daft.read_iceberg(table) for Iceberg tables (pass the PyIceberg table object)
  • Use daft.read_deltalake(uri) for Delta Lake tables
  • Use daft.read_hudi(path) for Hudi tables
  • Apply where() filters early to enable partition pruning
  • Iceberg supports time travel via snapshot_id parameter
  • Column projection (select) pushes down to avoid reading unnecessary columns

Step 3: Data Transformation

Apply transformations using Daft's DataFrame API. All operations are lazy and compose into an optimized query plan. Daft supports column projections, filters, joins (hash, sort-merge, broadcast), aggregations (groupby, global), window functions (rank, lag, lead, running aggregates), pivot/unpivot, explode for nested data, and arbitrary expression chains.

Key considerations:

  • Chain with_column() for adding or transforming columns
  • Use groupby().agg() for aggregations with multiple aggregate functions
  • Use join() with appropriate strategy hints for multi-table operations
  • Use Window() specifications for ranking, running totals, and lag/lead operations
  • Apply sort() for ordered output when writing partitioned tables
  • Use distinct() or drop_duplicates() for deduplication

Step 4: Data Quality Validation

Verify the transformed data meets quality requirements before writing. Use describe() or summarize() for statistical profiling, count_rows() for row counts, and conditional filters to identify data quality issues.

Key considerations:

  • Use show() to visually inspect a sample of transformed data
  • Use describe() for summary statistics across all columns
  • Apply where() with quality predicates to filter out bad records
  • Use count_rows() to verify expected output volume
  • Use explain() to review the optimized query plan before execution

Step 5: Data Loading

Write the transformed DataFrame to the target lakehouse table. Daft supports append and overwrite write modes for Iceberg and Delta Lake. The write operation is atomic and returns a metadata DataFrame describing the files written.

Key considerations:

  • Use df.write_iceberg(table, mode="append") to add rows to an existing Iceberg table
  • Use df.write_iceberg(table, mode="overwrite") to replace all data in an Iceberg table
  • Use df.write_deltalake(uri, mode="append"|"overwrite") for Delta Lake tables
  • For file-based output, use write_parquet() or write_csv() with partition_cols for Hive-style partitioning
  • The write operation returns a DataFrame with metadata (operation, rows, file sizes, file paths)
  • Writes are transactional for Iceberg and Delta Lake (ACID guarantees)

Step 6: Post-Write Verification

After writing, verify the data was persisted correctly by reading it back and comparing row counts or sampling records. For Iceberg, verify the new snapshot was created. For Delta Lake, check the version history.

Key considerations:

  • Re-read the written table and compare count_rows() with expected values
  • Use show() on the write metadata DataFrame to inspect file-level details
  • For Iceberg, the new snapshot ID is available in the write metadata
  • For production pipelines, integrate with monitoring and alerting systems

Execution Diagram

GitHub URL

Workflow Repository