Principle:Heibaiying BigData Notes Flink External Sink
| Knowledge Sources | |
|---|---|
| Domains | Stream_Processing, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Flink sinks output the results of stream processing to external systems such as databases, filesystems, and message queues, with custom sinks extending RichSinkFunction for full lifecycle management.
Description
A sink is the terminal operator in a Flink streaming pipeline. It receives processed records from upstream transformations and writes them to an external system. Flink provides built-in sinks for common targets (Kafka, Elasticsearch, HDFS, JDBC) and a flexible API for building custom sinks.
The SinkFunction interface defines the minimal contract: a single invoke() method called for each element in the stream. For sinks that require setup and teardown logic (database connections, file handles, network clients), Flink provides RichSinkFunction, which adds lifecycle methods:
- open(Configuration parameters) -- Called once when the operator is initialized. Used to establish connections, create prepared statements, or allocate resources.
- invoke(T value, Context context) -- Called for each incoming element. Contains the core write logic (e.g., executing a SQL INSERT, sending an HTTP request, writing to a file).
- close() -- Called once when the operator is shut down. Used to release resources, close connections, and flush buffers.
The Rich variant also provides access to the RuntimeContext via getRuntimeContext(), enabling access to metrics, accumulators, and broadcast variables.
For exactly-once sink semantics, Flink provides the TwoPhaseCommitSinkFunction base class, which implements a two-phase commit protocol coordinated with Flink's checkpointing. This is used in the Kafka producer sink and can be extended for custom transactional sinks.
Usage
Use Flink sinks when:
- Persisting results: Writing computed aggregations, alerts, or transformed records to a database for downstream querying.
- Forwarding events: Publishing processed events to another Kafka topic or message queue for downstream consumers.
- Writing to storage: Saving results to HDFS, S3, or local filesystem in formats like Parquet, Avro, or CSV.
- Custom integrations: Sending results to REST APIs, custom protocols, or proprietary systems that lack built-in Flink connectors.
Theoretical Basis
The sink operator sits at the leaf position of Flink's execution DAG. Its design follows the open-invoke-close lifecycle pattern common in resource-managing components:
Pseudocode:
class CustomSink extends RichSinkFunction<T>:
declare connection: Connection
open(configuration):
// Initialize external resource (called once per operator instance)
connection = ExternalSystem.connect(config)
invoke(value, context):
// Write each element to the external system
connection.write(value)
close():
// Release external resource (called once during shutdown)
connection.close()
Key considerations for sink design:
- Idempotency: When exactly-once semantics are required but the external system does not support transactions, design the write operation to be idempotent (e.g., using upserts with unique keys).
- Batching: For performance, buffer multiple records and write them in batches rather than one at a time. Flush the buffer in invoke() when the batch size is reached and in close() for remaining records.
- Backpressure: If the external system is slow, Flink's backpressure mechanism naturally slows down upstream operators, preventing data loss.
- Parallelism: Each parallel instance of the sink operator maintains its own connection and state. Ensure the external system can handle the concurrent write load.