Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Kafka Coordinator Runtime Lifecycle

From Leeroopedia


Knowledge Sources
Domains Distributed_Systems, State_Management
Last Updated 2026-02-09 12:00 GMT

Overview

End-to-end lifecycle of a coordinator partition within the Kafka CoordinatorRuntime framework, from partition loading through active operation processing to unloading.

Description

This workflow documents the complete lifecycle of a coordinator shard managed by the CoordinatorRuntime, Kafka's generic event-driven replicated state machine framework. The runtime maps each topic partition the broker leads to a CoordinatorShard instance that maintains the coordinator's replicated state. The lifecycle covers partition assignment, state loading from the log, active request processing (both reads and writes), snapshot management, metadata updates, and graceful or forced unloading. This framework underlies the group coordinator, transaction coordinator, and share coordinator.

Usage

Execute this workflow to understand how Kafka coordinator state machines operate at the partition level. This is relevant when debugging coordinator behavior, developing new coordinator types using the framework, or troubleshooting issues with consumer group management, transaction coordination, or share group coordination.

Execution Steps

Step 1: Partition Assignment and Context Creation

When the broker becomes leader for a coordinator topic partition, the CoordinatorRuntime creates a CoordinatorContext for that partition. The context transitions through states: INITIAL, LOADING, ACTIVE, FAILED, and CLOSED. A CoordinatorShardBuilder is obtained from the configured supplier and used to construct the shard with the appropriate timer, metadata image, and metrics shard.

Key considerations:

  • Each partition maps to exactly one CoordinatorShard instance
  • The CoordinatorContext tracks the partition's current state and epoch
  • The shard builder uses a supplier pattern for flexible construction
  • A SnapshotRegistry is created to support read-your-writes semantics within the shard

Step 2: State Loading from Log

The CoordinatorLoader asynchronously reads all committed records from the partition's log and replays them into the shard via the CoordinatorPlayback interface. During loading, the shard transitions from INITIAL to LOADING state. Write operations received during loading are queued in a DeferredEventCollection and replayed after loading completes.

Key considerations:

  • Loading is performed by CoordinatorLoaderImpl which reads MemoryRecords from the partition
  • Records are deserialized using the configured Deserializer and replayed into the shard
  • The loader tracks the last offset and epoch for proper state reconstruction
  • If loading fails, the partition transitions to FAILED state

Step 3: Active State and Write Operations

Once loading completes successfully, the partition transitions to ACTIVE state. Write operations are scheduled via scheduleWriteOperation, which creates a CoordinatorWriteEvent processed by the EventAccumulator and MultiThreadedEventProcessor. The write operation can read uncommitted state, produce CoordinatorRecords, and return a response that is parked until the records are committed.

Key considerations:

  • Write events are processed with partition-level serialization via the EventAccumulator
  • The PartitionWriter appends records to the partition log
  • Write results include both records to persist and a response future
  • Responses are delivered only after the produced records are committed (high watermark advances)
  • Failed writes trigger revert on the SnapshotRegistry

Step 4: Read Operations

Read operations are scheduled via scheduleReadOperation and only see committed state. Unlike writes, reads do not produce records and return their response immediately without waiting for log commits. The committed state offset is tracked to ensure reads are consistent.

Key considerations:

  • Reads use the committed snapshot from the SnapshotRegistry
  • Read operations cannot modify coordinator state
  • Multiple reads can be processed concurrently for the same partition
  • Read results are returned immediately to the caller

Step 5: Timer Management

The CoordinatorTimer allows shards to schedule deferred operations that execute after a specified delay. The CoordinatorTimerImpl manages per-partition timers that are integrated with the event processing pipeline, ensuring timer callbacks are processed with the same serialization guarantees as regular operations.

Key considerations:

  • Timers are used for session timeouts, heartbeat expiration, and periodic maintenance
  • Timer operations are executed within the partition's event processing context
  • Timers can be cancelled or rescheduled by the shard
  • Timer callbacks can produce records (write operations) or just update state

Step 6: Metadata Image Updates

The runtime receives KRaft metadata updates via onNewMetadataImage, which propagates cluster topology changes (broker additions/removals, topic changes) to all active shards. Each shard can react to metadata deltas to update its internal view of the cluster.

Key considerations:

  • Metadata images are provided by KRaftCoordinatorMetadataImage
  • Metadata deltas are computed by KRaftCoordinatorMetadataDelta
  • Shards use metadata to make partition assignment and routing decisions
  • The runtime processes metadata updates in the event pipeline for consistency

Step 7: Partition Unloading

When the broker loses leadership for a coordinator partition (or during shutdown), the runtime unloads the shard. This transitions the context to CLOSED state, cancels all pending timers, completes all pending events with appropriate errors, closes the shard, and removes the context from the runtime's active partition map.

Key considerations:

  • Unloading can be triggered by leadership loss, partition reassignment, or broker shutdown
  • All pending write operations receive a NotLeaderOrFollower error
  • The SnapshotRegistry and all associated resources are released
  • The partition can be re-loaded later if leadership is re-acquired

Execution Diagram

GitHub URL

Workflow Repository