Principle:DataTalksClub Data engineering zoomcamp Kafka Producer Pattern
Appearance
| Page Metadata | |
|---|---|
| Knowledge Sources | DataTalksClub/data-engineering-zoomcamp (07-streaming) |
| Domains | Data_Engineering, Stream_Processing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Publishing messages to a distributed log with key-based partitioning enables producers to serialize domain objects and route related events to the same partition for ordered processing.
Description
The Kafka producer pattern is the entry point for data into a streaming pipeline. A producer takes domain objects, serializes them into bytes, and sends them to a named topic on the broker. The pattern involves several key concepts:
- Topic selection: The producer writes to a specific topic, which acts as a named, append-only log. Topics are the primary organizational unit in Kafka.
- Key-based partitioning: Each message can carry a key. Kafka hashes the key to determine which partition receives the message. All messages with the same key land on the same partition, which guarantees ordering for that key. This is critical when downstream consumers need to process events for a given entity in sequence.
- Serialization: The producer must convert both the message key and the message value from application-level objects to byte arrays. Common strategies include encoding keys as UTF-8 strings and values as JSON. Custom serializers are configured at producer initialization.
- Asynchronous sends with confirmation: The
send()call is typically asynchronous, returning a future. The producer can call.get()on the future to block until the broker confirms receipt, or it can register a callback for non-blocking confirmation. - Error handling: Network issues, broker unavailability, and serialization failures must be caught and handled. Timeout errors are particularly common during broker restarts or network partitions.
Usage
Use this principle when:
- You need to ingest batch data (CSV files, database extracts) into a streaming pipeline.
- You want to ensure that related events are co-located on the same partition for ordered processing.
- You need to transform domain objects into serialized messages for downstream consumption.
- You are building a bridge between a file-based or request-based data source and a real-time processing system.
Theoretical Basis
The producer pattern follows a read-serialize-send loop:
FUNCTION produce_messages(source_path, topic, broker_config):
producer = create_producer(
bootstrap_servers = broker_config.servers,
key_serializer = FUNCTION(key): encode_as_string(key),
value_serializer = FUNCTION(obj): encode_as_json(obj.to_dict())
)
records = read_source_data(source_path)
FOR EACH record IN records:
partition_key = extract_partition_key(record)
TRY:
future = producer.send(
topic = topic,
key = partition_key,
value = record
)
confirmation = future.get() -- block until broker acknowledges
LOG "Record sent to offset {confirmation.offset}"
CATCH TimeoutError:
LOG "Broker unreachable, message not delivered"
producer.flush()
producer.close()
The design principles of this pattern are:
- Separation of concerns: The producer class handles transport (connecting to brokers, sending bytes). Serialization is delegated to configurable serializer functions. The domain model is responsible for its own structure.
- Idempotent key selection: The partition key should be a natural identifier from the domain (e.g., a location ID or user ID). This ensures deterministic partitioning without maintaining external state.
- Backpressure via synchronous confirmation: Calling
.get()on the send future provides natural backpressure -- the producer does not outrun the broker. For high-throughput scenarios, batching and asynchronous callbacks are preferred. - Graceful degradation: Timeout handling prevents the producer from crashing when the broker is temporarily unavailable, allowing for retry logic or dead-letter queuing.
Related Pages
- Implementation:DataTalksClub_Data_engineering_zoomcamp_JsonProducer_Implementation
- Principle:DataTalksClub_Data_engineering_zoomcamp_Streaming_Data_Model
- Principle:DataTalksClub_Data_engineering_zoomcamp_Kafka_Infrastructure_Setup
- Principle:DataTalksClub_Data_engineering_zoomcamp_Kafka_Consumer_Pattern
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment