Principle:DataExpert io Data engineer handbook Streaming ETL Pipeline
Overview
Streaming ETL Pipeline describes the theory of building Extract-Transform-Load (ETL) pipelines in a streaming context using Apache Flink. Unlike batch ETL, streaming ETL processes data continuously as it arrives, providing low-latency results while maintaining fault tolerance through checkpointing.
StreamExecutionEnvironment and StreamTableEnvironment
A Flink streaming ETL pipeline requires two foundational objects:
- StreamExecutionEnvironment -- the core runtime context for Flink streaming programs. It manages the execution graph, parallelism, and checkpointing configuration.
- StreamTableEnvironment -- a higher-level abstraction built on top of the execution environment that enables SQL and Table API operations on streaming data.
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(stream_execution_environment=env)
The StreamTableEnvironment bridges the DataStream API and the Table API, allowing developers to define sources, sinks, and transformations using SQL DDL and DML statements.
Checkpointing for Fault Tolerance
Checkpointing is Flink's mechanism for fault tolerance. It periodically snapshots the state of all operators so the job can recover from failures without data loss:
- Checkpoints are triggered at a configurable interval (e.g., every 10,000 milliseconds).
- Each checkpoint captures a consistent snapshot of the distributed state.
- On failure, Flink restores from the latest successful checkpoint and replays records from the source.
env.enable_checkpointing(10000) # checkpoint every 10 seconds
For streaming ETL pipelines that write to external sinks (e.g., PostgreSQL), checkpointing ensures exactly-once or at-least-once delivery semantics depending on the sink connector's capabilities.
INSERT INTO ... SELECT Pattern
The core of a streaming ETL pipeline in Flink SQL is the INSERT INTO ... SELECT pattern. This statement:
- Extracts data from a source table (e.g., Kafka).
- Transforms data using SQL expressions, UDFs, or joins.
- Loads the results into a sink table (e.g., PostgreSQL via JDBC).
INSERT INTO processed_events
SELECT
ip,
event_timestamp,
referrer,
host,
url,
get_location(ip) AS geodata
FROM events;
This pattern creates a continuous query -- the statement runs indefinitely, processing each new record as it arrives from the source.
When to Use
Use Streaming ETL Pipeline when:
- Building real-time data pipelines that consume from Kafka and write to PostgreSQL or other sinks.
- You need continuous, low-latency processing of incoming events.
- Fault tolerance and exactly-once semantics are required.
- The transformation logic can be expressed as SQL with optional UDF enrichment.