Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Risingwavelabs Risingwave Streaming ETL Pipeline

From Leeroopedia


Knowledge Sources
Domains Stream_Processing, Data_Engineering, ETL
Last Updated 2026-02-09 12:00 GMT

Overview

End-to-end process for building a real-time streaming ETL pipeline using RisingWave to ingest data from message brokers, transform it with SQL, and serve results via materialized views.

Description

This workflow outlines the standard procedure for creating a streaming ETL (Extract, Transform, Load) pipeline with RisingWave. It leverages RisingWave's Postgres-compatible SQL interface to define sources (data ingestion points), materialized views (continuous transformation logic), and optionally sinks (output destinations). The pipeline continuously processes incoming events with sub-second latency and serves query results with millisecond-level response times.

The process covers:

  • Goal: A running streaming pipeline that continuously transforms raw event data into queryable, pre-computed results.
  • Scope: From raw Kafka/Pulsar messages to materialized views serving ad-hoc queries.
  • Strategy: Uses standard SQL DDL and DML statements against RisingWave's Postgres-compatible interface, eliminating the need for custom stream processing code.

Usage

Execute this workflow when you have a streaming data source (Kafka, Pulsar, Kinesis, or other message brokers) producing continuous event data and need to perform real-time transformations such as aggregation, joining, filtering, or windowing. This is the foundational workflow for use cases including live dashboards, real-time monitoring, event-driven analytics, and feature engineering.

Execution Steps

Step 1: Deploy RisingWave Cluster

Start a RisingWave instance with the required infrastructure services. For development, use the RiseDev tool or Docker Compose. The cluster requires a meta service, compute nodes, a compactor, and object storage (MinIO for development, S3 for production).

Key considerations:

  • Choose a deployment mode appropriate for your environment (standalone for development, distributed for production)
  • Ensure object storage is configured and accessible
  • Verify network connectivity between RisingWave and your message broker

Step 2: Create Streaming Source

Define a source connector that tells RisingWave how to ingest data from your message broker. The source definition specifies the broker address, topic, data format (JSON, Avro, Protobuf), and the schema of incoming messages.

Key considerations:

  • Match the schema definition to your actual message format
  • Specify the correct encoding format (JSON, Avro with schema registry, Protobuf)
  • Configure consumer group settings for parallel ingestion

Step 3: Define Materialized Views

Create one or more materialized views that express your transformation logic in SQL. Materialized views are incrementally maintained, meaning RisingWave updates them in real-time as new data arrives rather than recomputing from scratch.

Key considerations:

  • Use standard SQL constructs: JOIN, GROUP BY, window functions, aggregations
  • Chain multiple materialized views for complex multi-stage transformations
  • Consider the state size implications of unbounded joins or aggregations

Step 4: Query Materialized Views

Connect to RisingWave using any Postgres-compatible client (psql, JDBC, Python psycopg2, etc.) and issue standard SELECT queries against your materialized views. Results reflect the latest processed data with sub-second freshness.

Key considerations:

  • Materialized views serve queries at 10-20ms p99 latency
  • Use standard Postgres client libraries for application integration
  • Monitor backpressure metrics to ensure the pipeline keeps up with input volume

Step 5: Monitor and Validate

Verify the pipeline is processing data correctly by checking record counts, monitoring the dashboard for backpressure and latency metrics, and validating output against expected results.

Key considerations:

  • Use the built-in web dashboard to visualize streaming fragment graphs
  • Monitor barrier latency for end-to-end freshness guarantees
  • Check Grafana dashboards for throughput and resource utilization metrics

Execution Diagram

GitHub URL

Workflow Repository