Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:DataTalksClub Data engineering zoomcamp Kafka Streams Topology

From Leeroopedia
Revision as of 17:29, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/DataTalksClub_Data_engineering_zoomcamp_Kafka_Streams_Topology.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Streaming, Kafka
Last Updated 2026-02-09 00:00 GMT

Overview

A declarative processing graph that defines how streaming records flow from input topics through transformations to output topics.

Description

A Kafka Streams topology is the central abstraction for defining stream processing logic. It represents a directed acyclic graph (DAG) of processing nodes -- called processors -- through which records flow continuously. The topology is composed of three types of processors:

  • Source processors: These are the entry points of the topology. Each source processor is bound to one or more input topics and emits records into the processing graph whenever new messages arrive on those topics.
  • Stream processors: These are intermediate nodes that receive records from upstream processors, apply transformations (filtering, mapping, branching, aggregating, joining), and forward the results downstream. Stream processors form the core computation logic.
  • Sink processors: These are the exit points of the topology. Each sink processor writes processed records to an output topic, making results available to downstream consumers or other topologies.

The topology is defined declaratively at application startup. Once built, the framework handles partitioning, parallelism, fault tolerance, and state management transparently. Key transformation patterns include:

  • GroupBy and aggregation: Records are repartitioned by a chosen key and then reduced (counted, summed, or combined) into a running aggregate. This requires an internal state store to maintain intermediate results across incoming records.
  • Stream-stream joins: Two input streams are joined based on a shared key within a configurable time window. Both sides of the join are buffered in state stores, and matching pairs are emitted when a record from one side finds its counterpart in the other.
  • Windowed aggregation: Records are grouped into time-bounded windows (tumbling, hopping, or session) before aggregation. This allows computing metrics over fixed time intervals such as counts per minute or averages per hour.

The topology abstraction decouples the what (processing logic) from the how (execution, scaling, recovery), enabling the same logical topology to run on a single thread or across a distributed cluster of instances.

Usage

Use this principle when:

  • You need to process an unbounded stream of records in near real-time.
  • Your processing logic can be expressed as a sequence of stateless transformations (map, filter) and/or stateful operations (aggregation, join, windowing).
  • You want to consume from and produce to Kafka topics within a single application.
  • You require automatic state management, fault tolerance, and exactly-once processing semantics.
  • You need to combine multiple input streams or enrich records by joining them against another stream.

Theoretical Basis

A stream processing topology can be modeled as a DAG G = (V, E) where vertices V are processors and edges E define the data flow between them.

Core topology structure:

TOPOLOGY = DAG(
    sources:     { S1 -> Topic_A, S2 -> Topic_B },
    processors:  { P1, P2, ..., Pn },
    sinks:       { K1 -> Topic_C }
)

DATA FLOW:
    S1 --> P1 --> P2 --> K1
    S2 --> P3 ----^

GroupBy aggregation pattern:

FUNCTION groupby_count(input_stream):
    grouped = input_stream
        .GROUP_BY(record -> extract_key(record))
        .COUNT()

    -- Internally:
    -- 1. Repartition records by the new key (may produce to internal topic)
    -- 2. For each incoming record with key K:
    --        state_store[K] = state_store[K] + 1
    -- 3. Emit (K, state_store[K]) downstream

    RETURN grouped

Stream-stream join pattern:

FUNCTION stream_join(left_stream, right_stream, window_duration):
    joined = left_stream.JOIN(
        right_stream,
        join_condition = (left.key == right.key),
        within         = window_duration,
        combiner       = FUNCTION(left_value, right_value):
                             RETURN merge(left_value, right_value)
    )

    -- Internally:
    -- 1. Buffer left records in state_store_L with expiry = window_duration
    -- 2. Buffer right records in state_store_R with expiry = window_duration
    -- 3. When record R arrives on left_stream with key K:
    --        FOR EACH matching record M in state_store_R WHERE M.key == K
    --            AND M.timestamp within [R.timestamp - window, R.timestamp + window]:
    --                EMIT combiner(R.value, M.value)
    -- 4. Symmetric logic for records arriving on right_stream

    RETURN joined

Windowed aggregation pattern:

FUNCTION windowed_count(input_stream, window_size, grace_period):
    windowed = input_stream
        .GROUP_BY_KEY()
        .WINDOW_BY(tumbling_window(size = window_size, grace = grace_period))
        .COUNT()

    -- Internally:
    -- 1. Assign each record to a window based on its event timestamp:
    --        window_start = floor(record.timestamp / window_size) * window_size
    --        window_end   = window_start + window_size
    -- 2. Composite key = (record.key, window_start, window_end)
    -- 3. state_store[composite_key] = state_store[composite_key] + 1
    -- 4. After window_end + grace_period, the window is closed and
    --    late-arriving records are discarded

    RETURN windowed

The key theoretical properties of a well-formed topology are:

  1. Deterministic processing: Given the same input records in the same order, the topology produces the same output, enabling reproducible results and testability.
  2. Partition-based parallelism: Each input partition is processed by exactly one stream task. Scaling is achieved by increasing partitions and stream threads or application instances.
  3. State locality: Stateful operations (aggregations, joins) maintain their state in local stores co-located with the processing thread, avoiding network round-trips for state access.
  4. Changelog-backed recovery: State stores are backed by changelog topics. On failure, state is rebuilt by replaying the changelog, ensuring no data loss.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment