Principle:DataTalksClub Data engineering zoomcamp Stream Output Sink
| Page Metadata | |
|---|---|
| Knowledge Sources | DataTalksClub/data-engineering-zoomcamp (07-streaming) |
| Domains | Data_Engineering, Stream_Processing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Directing processed stream results to output destinations defines where streaming results are delivered, with configurable output modes and trigger intervals controlling how and when results are emitted.
Description
An output sink is the terminal stage of a streaming pipeline. After data has been read from a source, parsed, and transformed, the results must be written somewhere. The choice of sink and its configuration determine the pipeline's behavior and utility.
The key concepts in the output sink pattern are:
- Sink destination: The target for streaming results. Common destinations include:
- Console sink: Prints results to standard output. Used for debugging and development. Not suitable for production workloads.
- Kafka sink: Writes results back to a Kafka topic for downstream consumption by other services or pipelines. This is the standard production pattern for chaining streaming stages.
- Memory sink: Stores results in an in-memory table that can be queried with SQL. Useful for interactive exploration and testing.
- File sink: Writes results to files (Parquet, JSON, CSV) on a distributed file system. Used for archival and batch-streaming hybrid architectures.
- Output mode: Controls which rows are written to the sink on each trigger:
- Append: Only new rows added since the last trigger are written. Suitable for non-aggregation queries.
- Complete: The entire result table is written on each trigger. Required for aggregation queries where previously emitted results may change.
- Update: Only rows that have changed since the last trigger are written. A middle ground between append and complete.
- Trigger interval: Controls how frequently the streaming engine processes a micro-batch. A trigger of 5 seconds means the engine will accumulate messages for 5 seconds, process them, and emit results. Shorter intervals reduce latency; longer intervals improve throughput.
- Data format preparation: When writing to a Kafka sink, the output DataFrame must conform to Kafka's expected schema: a key column (string) and a value column (string). This requires concatenating result columns into a single value string and optionally renaming a column as the key.
- Checkpoint location: The sink records its progress to a checkpoint directory. This enables exactly-once delivery to the output by tracking which micro-batches have been successfully written.
Usage
Use this principle when:
- You need to deliver streaming results to downstream systems, user interfaces, or storage layers.
- You are debugging a streaming pipeline and need to inspect intermediate or final results.
- You want to chain multiple streaming stages by writing results to a Kafka topic that feeds into another pipeline.
- You need to control the latency-throughput tradeoff via trigger interval tuning.
Theoretical Basis
The output sink pattern follows a prepare-configure-write model:
-- Console sink for debugging
FUNCTION sink_to_console(dataframe, output_mode, trigger_interval):
query = dataframe
.writeStream
.outputMode(output_mode) -- "complete", "append", or "update"
.trigger(processingTime=trigger_interval) -- e.g., "5 seconds"
.format("console")
.option("truncate", FALSE)
.start()
RETURN query
-- Kafka sink for production
FUNCTION sink_to_broker(dataframe, topic):
-- Step 1: Prepare the DataFrame for the broker's expected schema
prepared = dataframe
.withColumn("value", concatenate(result_columns, separator=", "))
.rename(key_column, "key")
.cast("key", STRING)
.select("key", "value")
-- Step 2: Write to the broker
query = prepared
.writeStream
.format("message_broker")
.option("bootstrap.servers", "broker:9092")
.outputMode("complete")
.option("topic", topic)
.option("checkpointLocation", "checkpoint/")
.start()
RETURN query
The design principles of this pattern are:
- Sink independence: The same streaming DataFrame can be written to multiple sinks simultaneously. A console sink for debugging and a Kafka sink for production can run in parallel on the same data.
- Output mode alignment: The output mode must match the query semantics. Aggregation queries require complete or update mode because previously emitted counts may change as new data arrives. Non-aggregation queries can use append mode.
- Schema conformance for broker sinks: Kafka expects a simple key-value schema. The prepare step transforms a multi-column aggregation result into the required two-column format, concatenating value columns and casting the key to a string.
- Checkpoint-based exactly-once delivery: The checkpoint directory ensures that each micro-batch is written to the sink exactly once, even across restarts. Without a checkpoint, the pipeline would either skip data or produce duplicates after recovery.