Workflow:Risingwavelabs Risingwave Streaming ETL Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Stream_Processing, Data_Engineering, ETL |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
End-to-end process for building a real-time streaming ETL pipeline using RisingWave to ingest data from message brokers, transform it with SQL, and serve results via materialized views.
Description
This workflow outlines the standard procedure for creating a streaming ETL (Extract, Transform, Load) pipeline with RisingWave. It leverages RisingWave's Postgres-compatible SQL interface to define sources (data ingestion points), materialized views (continuous transformation logic), and optionally sinks (output destinations). The pipeline continuously processes incoming events with sub-second latency and serves query results with millisecond-level response times.
The process covers:
- Goal: A running streaming pipeline that continuously transforms raw event data into queryable, pre-computed results.
- Scope: From raw Kafka/Pulsar messages to materialized views serving ad-hoc queries.
- Strategy: Uses standard SQL DDL and DML statements against RisingWave's Postgres-compatible interface, eliminating the need for custom stream processing code.
Usage
Execute this workflow when you have a streaming data source (Kafka, Pulsar, Kinesis, or other message brokers) producing continuous event data and need to perform real-time transformations such as aggregation, joining, filtering, or windowing. This is the foundational workflow for use cases including live dashboards, real-time monitoring, event-driven analytics, and feature engineering.
Execution Steps
Step 1: Deploy RisingWave Cluster
Start a RisingWave instance with the required infrastructure services. For development, use the RiseDev tool or Docker Compose. The cluster requires a meta service, compute nodes, a compactor, and object storage (MinIO for development, S3 for production).
Key considerations:
- Choose a deployment mode appropriate for your environment (standalone for development, distributed for production)
- Ensure object storage is configured and accessible
- Verify network connectivity between RisingWave and your message broker
Step 2: Create Streaming Source
Define a source connector that tells RisingWave how to ingest data from your message broker. The source definition specifies the broker address, topic, data format (JSON, Avro, Protobuf), and the schema of incoming messages.
Key considerations:
- Match the schema definition to your actual message format
- Specify the correct encoding format (JSON, Avro with schema registry, Protobuf)
- Configure consumer group settings for parallel ingestion
Step 3: Define Materialized Views
Create one or more materialized views that express your transformation logic in SQL. Materialized views are incrementally maintained, meaning RisingWave updates them in real-time as new data arrives rather than recomputing from scratch.
Key considerations:
- Use standard SQL constructs: JOIN, GROUP BY, window functions, aggregations
- Chain multiple materialized views for complex multi-stage transformations
- Consider the state size implications of unbounded joins or aggregations
Step 4: Query Materialized Views
Connect to RisingWave using any Postgres-compatible client (psql, JDBC, Python psycopg2, etc.) and issue standard SELECT queries against your materialized views. Results reflect the latest processed data with sub-second freshness.
Key considerations:
- Materialized views serve queries at 10-20ms p99 latency
- Use standard Postgres client libraries for application integration
- Monitor backpressure metrics to ensure the pipeline keeps up with input volume
Step 5: Monitor and Validate
Verify the pipeline is processing data correctly by checking record counts, monitoring the dashboard for backpressure and latency metrics, and validating output against expected results.
Key considerations:
- Use the built-in web dashboard to visualize streaming fragment graphs
- Monitor barrier latency for end-to-end freshness guarantees
- Check Grafana dashboards for throughput and resource utilization metrics