Heuristic:DataExpert io Data engineer handbook Watermark Late Arrival Tolerance
| Knowledge Sources | |
|---|---|
| Domains | Stream_Processing, Optimization |
| Last Updated | 2026-02-09 06:00 GMT |
Overview
Watermark strategy using 15-second late arrival tolerance for Flink windowed aggregations over Kafka event streams.
Description
In Flink event-time processing, watermarks signal to the system that no events with timestamps older than the watermark value should be expected. The watermark delay (also called bounded out-of-orderness) defines how long the system waits for late-arriving events before closing a window and emitting results. The repository uses a 15-second watermark delay, meaning events arriving up to 15 seconds after their event timestamp can still be included in the correct window.
Usage
Apply this heuristic when defining watermark strategies for Flink SQL or DataStream windowed aggregations over event-time data. The 15-second tolerance is appropriate for web traffic analytics where network delays and client-side batching can cause modest event lateness. Adjust the tolerance based on the expected lateness profile of your data source.
The Insight (Rule of Thumb)
- Action: Define watermark in Flink SQL DDL using `WATERMARK FOR <timestamp_col> AS <timestamp_col> - INTERVAL '<N>' SECOND`.
- Value: 15 seconds for web traffic event streams with moderate lateness.
- Trade-off: Larger watermark delays capture more late events (higher completeness) but increase end-to-end latency of window results. Smaller delays reduce latency but drop more late events.
- Rule: Set the watermark delay to the 95th-99th percentile of observed event lateness. If most events arrive within 10 seconds but some take up to 15 seconds, a 15-second watermark provides good coverage.
Reasoning
Web traffic events from Confluent Cloud Kafka can experience lateness due to:
- Client-side batching and network delays
- Kafka producer buffering (`linger.ms`)
- Cross-region replication lag
A 15-second watermark is a pragmatic balance for a bootcamp/demo scenario:
- Completeness: Captures the vast majority of late events
- Latency: Window results are delayed by at most 15 seconds beyond the window end
- Memory: Flink must buffer state for open windows during the watermark delay
For production systems, measure actual event lateness distributions and set the watermark accordingly. Some production systems use watermarks of 1-5 minutes for high-lateness data sources.
Code Evidence
Watermark definition from `aggregation_job.py:87-95`:
CREATE TABLE events_source_kafka (
...
window_timestamp AS CAST(event_timestamp AS TIMESTAMP(3)),
WATERMARK FOR window_timestamp AS window_timestamp - INTERVAL '15' SECOND
) WITH (
'connector' = 'kafka',
'scan.startup.mode' = 'latest-offset',
...
)
Tumbling window that depends on the watermark from `aggregation_job.py:96-108`:
SELECT
host,
window_start,
window_end,
COUNT(*) AS num_hits
FROM TABLE(
TUMBLE(TABLE events_source_kafka, DESCRIPTOR(window_timestamp), INTERVAL '1' MINUTE)
)
GROUP BY host, window_start, window_end