Principle:DataTalksClub Data engineering zoomcamp Streaming Data Model

Page Metadata
Knowledge Sources	DataTalksClub/data-engineering-zoomcamp (07-streaming)
Domains	Data_Engineering, Stream_Processing
Last Updated	2026-02-09 14:00 GMT

Overview

Defining strongly-typed data models for stream processing enforces schema contracts between producers and consumers at the application layer, ensuring reliable serialization and deserialization of domain events.

Description

In a streaming architecture, messages flow between producers and consumers as serialized bytes. Without a well-defined data model, each side of the pipeline must independently guess the structure of the data, leading to brittle integrations and runtime failures.

A streaming data model serves several critical purposes:

Schema enforcement: By defining a class or struct with typed fields, the application catches type mismatches at construction time rather than at query time downstream.
Serialization contract: The data model specifies how raw input (CSV rows, JSON objects, Avro records) maps to structured domain objects. This contract is shared between producer and consumer code.
Deserialization symmetry: A well-designed model provides both a primary constructor (from raw arrays or positional data) and an alternate constructor (from dictionaries or named fields), enabling round-trip serialization through different formats.
Domain semantics: Field names like pickup_datetime, passenger_count, and fare_amount carry domain meaning that raw arrays or generic dictionaries lack. This makes the code self-documenting and reduces errors.

The data model typically handles type coercion during construction -- converting string representations to integers, decimals, and timestamps. This ensures that all downstream processing operates on correctly-typed values.

Usage

Use this principle when:

Your streaming pipeline ingests semi-structured data (CSV, JSON) that must be parsed into typed objects.
Multiple components (producers, consumers, stream processors) share the same message format.
You need to support multiple serialization formats (JSON, Avro) from the same domain model.
You want compile-time or construction-time validation of message fields.

Theoretical Basis

A streaming data model defines a mapping function from raw input to a typed domain object:

CLASS DomainEvent:
    CONSTRUCTOR(raw_fields: Array[String]):
        field_1 = parse_as_string(raw_fields[0])
        field_2 = parse_as_datetime(raw_fields[1], format="%Y-%m-%d %H:%M:%S")
        field_3 = parse_as_integer(raw_fields[2])
        field_4 = parse_as_decimal(raw_fields[3])
        ...

    CLASSMETHOD from_dict(dictionary: Map[String, Any]) -> DomainEvent:
        raw = extract_ordered_values(dictionary)
        RETURN CONSTRUCTOR(raw)

    METHOD to_dict() -> Map[String, Any]:
        RETURN {field_name: field_value FOR EACH field IN self}

The design principles behind this pattern are:

Single source of truth: One class defines all fields, their types, and their parsing logic. Both producer serialization and consumer deserialization reference the same model.
Positional and named access: The primary constructor accepts positional data (for CSV parsing), while the alternate constructor accepts named data (for JSON deserialization). This dual interface supports multiple input formats.
Immutability at construction: All type coercion happens in the constructor. Once the object is created, its fields are correctly typed and can be used without further validation.
Introspection for serialization: The model exposes its fields as a dictionary (via __dict__ or equivalent), enabling generic serializers to convert the object to JSON or other formats without field-by-field mapping.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment