Workflow:Risingwavelabs Risingwave Iceberg Lakehouse Ingestion

Knowledge Sources	RisingWave RisingWave Docs Iceberg Overview
Domains	Stream_Processing, Data_Lake, Iceberg
Last Updated	2026-02-09 12:00 GMT

Overview

End-to-end process for continuously ingesting and transforming streaming data into Apache Iceberg tables using RisingWave, enabling real-time lakehouse architectures with open table formats.

Description

This workflow describes how to build a streaming lakehouse pipeline where RisingWave continuously processes incoming data and writes the results to Apache Iceberg tables stored on object storage. RisingWave treats Iceberg as a first-class citizen, supporting both Merge-on-Read (MoR) and Copy-on-Write (CoW) write modes, built-in table maintenance (compaction, vacuum, snapshot cleanup), and direct Iceberg REST catalog management.

The process covers:

Goal: Continuously updated Iceberg tables containing transformed streaming data, queryable by any Iceberg-compatible engine (Spark, Presto, Trino).
Scope: From data ingestion through transformation to Iceberg table write and maintenance.
Strategy: Uses RisingWave's native Iceberg integration with the Java connector node for catalog operations and data file writing via the JNI bridge.

Usage

Execute this workflow when you need to build a real-time data lakehouse where streaming data is continuously written to Apache Iceberg tables for unified analytics, long-term retention, and interoperability with multiple query engines. This is suitable for scenarios requiring both real-time freshness and historical analytics on the same data.

Execution Steps

Step 1: Deploy RisingWave with Object Storage

Start RisingWave configured with object storage (S3, MinIO, GCS, or Azure Blob Storage) that will serve as the backing store for both RisingWave's internal state and the Iceberg tables. Ensure the Iceberg catalog service is accessible.

Key considerations:

Configure S3-compatible storage credentials and bucket
Set up an Iceberg catalog (REST, Hive Metastore, JDBC, or Glue)
Ensure the Java connector node is running for Iceberg operations via JNI

Step 2: Create Streaming Source

Define a source connector that ingests data from your message broker or CDC source. This feeds the raw data into RisingWave for processing.

Key considerations:

Match the source schema to your incoming data format
Choose appropriate data format (JSON, Avro, Protobuf)
For CDC-to-Iceberg pipelines, create CDC source tables directly

Step 3: Define Transformation Logic

Create materialized views that express the transformation, aggregation, or enrichment logic you want applied before writing to Iceberg. This can include joins, window functions, deduplication, and other SQL operations.

Key considerations:

Materialized views are incrementally maintained for efficiency
Chain multiple views for complex multi-stage transformations
Consider partitioning strategy for the eventual Iceberg table

Step 4: Create Iceberg Sink

Define an Iceberg sink that connects a materialized view or table to an Iceberg table. Specify the catalog type, warehouse location, database, table name, and write mode (append-only or upsert).

Key considerations:

Choose between Merge-on-Read (MoR) for write-heavy workloads and Copy-on-Write (CoW) for read-heavy workloads
Specify the Iceberg catalog configuration (REST catalog URI, warehouse path)
Define primary keys for upsert mode sinks

Step 5: Query via External Engines

Query the Iceberg tables using external analytics engines such as Spark, Presto, or Trino. The data written by RisingWave follows the open Iceberg format and is fully compatible with the broader Iceberg ecosystem.

Key considerations:

Configure external engines to use the same Iceberg catalog
Data freshness depends on RisingWave's checkpoint interval and sink commit frequency
Use Presto or Trino for interactive analytics on the Iceberg tables

Step 6: Maintain Iceberg Tables

Perform ongoing table maintenance including compaction of small files, snapshot expiration, and orphan file cleanup. RisingWave provides built-in maintenance capabilities, and external tools like Airflow can schedule recurring maintenance tasks.

Key considerations:

Small file compaction improves query performance on Iceberg tables
Snapshot expiration prevents unbounded metadata growth
Orphan file cleanup reclaims storage from failed or aborted writes

Execution Diagram

GitHub URL

Workflow Repository