Workflow:Risingwavelabs Risingwave Iceberg Lakehouse Ingestion
| Knowledge Sources | |
|---|---|
| Domains | Stream_Processing, Data_Lake, Iceberg |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
End-to-end process for continuously ingesting and transforming streaming data into Apache Iceberg tables using RisingWave, enabling real-time lakehouse architectures with open table formats.
Description
This workflow describes how to build a streaming lakehouse pipeline where RisingWave continuously processes incoming data and writes the results to Apache Iceberg tables stored on object storage. RisingWave treats Iceberg as a first-class citizen, supporting both Merge-on-Read (MoR) and Copy-on-Write (CoW) write modes, built-in table maintenance (compaction, vacuum, snapshot cleanup), and direct Iceberg REST catalog management.
The process covers:
- Goal: Continuously updated Iceberg tables containing transformed streaming data, queryable by any Iceberg-compatible engine (Spark, Presto, Trino).
- Scope: From data ingestion through transformation to Iceberg table write and maintenance.
- Strategy: Uses RisingWave's native Iceberg integration with the Java connector node for catalog operations and data file writing via the JNI bridge.
Usage
Execute this workflow when you need to build a real-time data lakehouse where streaming data is continuously written to Apache Iceberg tables for unified analytics, long-term retention, and interoperability with multiple query engines. This is suitable for scenarios requiring both real-time freshness and historical analytics on the same data.
Execution Steps
Step 1: Deploy RisingWave with Object Storage
Start RisingWave configured with object storage (S3, MinIO, GCS, or Azure Blob Storage) that will serve as the backing store for both RisingWave's internal state and the Iceberg tables. Ensure the Iceberg catalog service is accessible.
Key considerations:
- Configure S3-compatible storage credentials and bucket
- Set up an Iceberg catalog (REST, Hive Metastore, JDBC, or Glue)
- Ensure the Java connector node is running for Iceberg operations via JNI
Step 2: Create Streaming Source
Define a source connector that ingests data from your message broker or CDC source. This feeds the raw data into RisingWave for processing.
Key considerations:
- Match the source schema to your incoming data format
- Choose appropriate data format (JSON, Avro, Protobuf)
- For CDC-to-Iceberg pipelines, create CDC source tables directly
Step 3: Define Transformation Logic
Create materialized views that express the transformation, aggregation, or enrichment logic you want applied before writing to Iceberg. This can include joins, window functions, deduplication, and other SQL operations.
Key considerations:
- Materialized views are incrementally maintained for efficiency
- Chain multiple views for complex multi-stage transformations
- Consider partitioning strategy for the eventual Iceberg table
Step 4: Create Iceberg Sink
Define an Iceberg sink that connects a materialized view or table to an Iceberg table. Specify the catalog type, warehouse location, database, table name, and write mode (append-only or upsert).
Key considerations:
- Choose between Merge-on-Read (MoR) for write-heavy workloads and Copy-on-Write (CoW) for read-heavy workloads
- Specify the Iceberg catalog configuration (REST catalog URI, warehouse path)
- Define primary keys for upsert mode sinks
Step 5: Query via External Engines
Query the Iceberg tables using external analytics engines such as Spark, Presto, or Trino. The data written by RisingWave follows the open Iceberg format and is fully compatible with the broader Iceberg ecosystem.
Key considerations:
- Configure external engines to use the same Iceberg catalog
- Data freshness depends on RisingWave's checkpoint interval and sink commit frequency
- Use Presto or Trino for interactive analytics on the Iceberg tables
Step 6: Maintain Iceberg Tables
Perform ongoing table maintenance including compaction of small files, snapshot expiration, and orphan file cleanup. RisingWave provides built-in maintenance capabilities, and external tools like Airflow can schedule recurring maintenance tasks.
Key considerations:
- Small file compaction improves query performance on Iceberg tables
- Snapshot expiration prevents unbounded metadata growth
- Orphan file cleanup reclaims storage from failed or aborted writes