Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataExpert io Data engineer handbook Windowed Aggregation

From Leeroopedia


Overview

Windowed Aggregation describes the theory and practice of computing aggregations over bounded time intervals in stream processing. Windows partition an infinite stream into finite chunks, enabling operations like counting, summing, and averaging over defined time periods.

Window Types

Apache Flink supports several window types, each suited to different analytical needs:

Tumbling Windows

Tumbling windows are fixed-size, non-overlapping time intervals. Every event belongs to exactly one window.

  • Example: 5-minute tumbling windows create buckets [00:00-00:05), [00:05-00:10), etc.
  • Use case: periodic reporting, hourly hit counts.

Sliding Windows

Sliding windows have a fixed size but advance by a configurable slide interval. Windows may overlap, meaning a single event can belong to multiple windows.

  • Example: 10-minute windows sliding every 5 minutes.
  • Use case: moving averages, trend detection.

Session Windows

Session windows are defined by a gap of inactivity. A window closes when no new events arrive within the specified gap duration.

  • Example: session window with a 30-minute gap.
  • Use case: user session analysis, activity tracking.

Watermarks for Handling Late Data

In streaming systems, events may arrive out of order or late. Watermarks are special timestamps that signal to Flink that no events with a timestamp earlier than the watermark are expected:

event_timestamp AS TO_TIMESTAMP(event_time, 'yyyy-MM-dd''T''HH:mm:ss.SSSSSS'),
WATERMARK FOR event_timestamp AS event_timestamp - INTERVAL '15' SECOND

Key concepts:

  • A watermark of event_timestamp - INTERVAL '15' SECOND allows events up to 15 seconds late to be included in the correct window.
  • Events arriving after the watermark has passed are considered late and are typically dropped (or handled by a side output).
  • The watermark interval represents a trade-off between completeness (allowing more late data) and latency (emitting results sooner).

Multiple Aggregation Pipelines

A single Kafka source can feed multiple aggregation pipelines simultaneously. Each pipeline:

  • Applies its own window definition.
  • Groups by different dimensions (e.g., host only, or host + referrer).
  • Writes to a separate sink table.

This fan-out pattern allows one source stream to produce multiple analytical views without consuming the topic multiple times.

When to Use

Use Windowed Aggregation when:

  • Computing time-bucketed metrics from streaming data (e.g., hits per hour, events per 5 minutes).
  • You need to handle out-of-order or late-arriving events via watermarks.
  • Multiple aggregation views are needed from the same source stream.

Related Pages

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment