Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:DataTalksClub Data engineering zoomcamp Kafka Custom Serialization

From Leeroopedia


Knowledge Sources
Domains Streaming, Kafka, Serialization
Last Updated 2026-02-09 00:00 GMT

Overview

Custom encoding and decoding strategies for converting domain objects to and from byte streams for Kafka transport.

Description

Kafka transmits all data as raw byte arrays. To send and receive structured domain objects (such as ride events, user profiles, or sensor readings), applications must define serializers that convert objects to bytes and deserializers that convert bytes back to objects. Together, a matched serializer-deserializer pair is called a Serde (Serializer/Deserializer).

Custom serialization is necessary when the built-in string and integer serializers are insufficient for the application's data model. Two dominant encoding strategies exist:

  • JSON serialization: The domain object is converted to a JSON string and then encoded as UTF-8 bytes. This approach is schema-less -- the structure of the data is implicit in the JSON document itself. JSON is human-readable, widely supported, and easy to debug, but it carries field names in every message, increasing payload size. No external schema registry is required, making it simpler to set up.
  • Avro serialization: The domain object is encoded using Apache Avro, a compact binary format that relies on an external schema registry to store and resolve schemas. The serialized payload contains only the data values (no field names), prefixed with a schema ID. The deserializer fetches the schema from the registry using this ID and decodes the bytes accordingly. Avro provides strong schema evolution guarantees (backward, forward, and full compatibility) at the cost of additional infrastructure.

The Serde abstraction encapsulates both directions of conversion behind a unified interface, allowing stream processing frameworks to transparently handle serialization at topology boundaries (source and sink processors) and at internal repartition points.

Usage

Use this principle when:

  • Your stream processing application needs to read or write structured domain objects beyond simple strings or integers.
  • You want a reusable, pluggable serialization layer that can be swapped between JSON and Avro without changing processing logic.
  • You need schema evolution support for messages that change structure over time (prefer Avro).
  • You need rapid development and human-readable messages for debugging (prefer JSON).
  • You are building custom Serde instances to register with a stream processing topology.

Theoretical Basis

A Serde is formally a pair of inverse functions:

DEFINITION Serde<T>:
    serializer:   T -> byte[]
    deserializer: byte[] -> T

INVARIANT:
    FOR ALL objects x of type T:
        deserializer(serializer(x)) == x

JSON Serde construction:

FUNCTION create_json_serde(target_class):

    FUNCTION serialize(topic, object):
        json_string = convert_to_json(object)
        RETURN encode_utf8(json_string)

    FUNCTION deserialize(topic, bytes):
        json_string = decode_utf8(bytes)
        RETURN parse_json(json_string, target_class)

    RETURN Serde(
        serializer   = serialize,
        deserializer = deserialize
    )

Avro Serde construction:

FUNCTION create_avro_serde(schema_registry_url, target_class):

    FUNCTION serialize(topic, object):
        schema = lookup_or_register_schema(schema_registry_url, topic, target_class)
        avro_bytes = encode_avro(object, schema)
        -- Payload format: [magic_byte][schema_id (4 bytes)][avro_bytes]
        RETURN concat(MAGIC_BYTE, int_to_bytes(schema.id), avro_bytes)

    FUNCTION deserialize(topic, bytes):
        magic_byte  = bytes[0]
        schema_id   = bytes_to_int(bytes[1..4])
        avro_bytes  = bytes[5..]
        schema      = fetch_schema(schema_registry_url, schema_id)
        RETURN decode_avro(avro_bytes, schema, target_class)

    RETURN Serde(
        serializer   = serialize,
        deserializer = deserialize
    )

Serde registration in a topology:

FUNCTION configure_topology(config):
    -- Register default Serde for keys and values
    config.set("default.key.serde",   create_json_serde(StringKey))
    config.set("default.value.serde", create_json_serde(DomainObject))

    -- Or override per-operator:
    stream = topology
        .source(topic, key_serde = StringSerde, value_serde = custom_json_serde)
        .groupBy(new_key_extractor, grouped_serde = custom_json_serde)
        .count(materialized_with = LongSerde)

The key theoretical properties of custom serialization are:

  1. Bijectivity: A correct Serde guarantees that serialization followed by deserialization yields the original object. Any violation of this invariant (e.g., precision loss in floating-point fields, timezone mishandling in timestamps) introduces silent data corruption.
  2. Schema coupling: JSON Serdes are loosely coupled to the data schema -- any valid JSON can be deserialized if the target class can accommodate it. Avro Serdes are tightly coupled through the schema registry, providing compile-time-like safety at the cost of registry dependency.
  3. Performance trade-off: JSON payloads are larger (field names repeated per message) but require no external infrastructure. Avro payloads are compact (data only, schema resolved via registry) but require a running schema registry and network lookups during deserialization.
  4. Composability: Serdes are composable with stream processing operators. A topology can use different Serdes at different stages -- for example, JSON for the input topic, a custom Serde for internal repartition topics, and Avro for the output topic.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment