Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Beam Windmill Connection Setup

From Leeroopedia


Field Value
Principle Name Windmill Connection Setup
Domain Distributed_Systems, Streaming_Processing, gRPC
Overview Process of establishing bidirectional gRPC streaming connections to the Windmill backend for work distribution, data retrieval, and result commitment.
Related Implementation Implementation:Apache_Beam_FanOutStreamingEngineWorkerHarness
Repository apache/beam
last_updated 2026-02-09 04:00 GMT

Overview

Windmill connection setup is the process of establishing bidirectional gRPC streaming connections to the Windmill backend for work distribution, data retrieval, and result commitment. Windmill is the coordination and state management service at the heart of Dataflow Streaming, and every streaming worker must maintain active connections to it in order to receive and process work.

Description

Windmill serves as the shuffle and state service in Dataflow Streaming. It is responsible for:

  • Work distribution: Assigning work items (keyed elements and timer firings) to workers.
  • State management: Storing and serving per-key persistent state accessed during work item processing.
  • Commit coordination: Accepting processed work results and advancing watermarks.

Connection setup involves creating and managing several types of gRPC streams:

1. GetWork Streams: Bidirectional gRPC streams through which the worker requests work items from Windmill. The worker sends a GetWorkRequest with a budget (maximum items and bytes) and receives a stream of GetWorkResponse messages containing work items. The budget controls how much outstanding work a single worker can hold.

2. GetData Streams: Streams for reading persistent state from Windmill. When a DoFn accesses state during processing, the request is routed through a GetData stream to the appropriate Windmill backend. These streams support both keyed state lookups and side input reads.

3. CommitWork Streams: Streams for sending processed results back to Windmill. After a work item is processed, the state mutations, output elements, and timer updates are bundled into a WorkItemCommitRequest and sent through a commit stream.

4. GetWorkerMetadata Stream: A special stream used in the fan-out architecture to receive endpoint assignments from the Windmill dispatcher. This stream informs the worker which Windmill backends to connect to and triggers connection reconfiguration when backends change.

The fan-out harness (FanOutStreamingEngineWorkerHarness) manages connections to multiple Windmill backends simultaneously. Key aspects of the fan-out architecture include:

  • Budget Distribution: The total GetWork budget (items + bytes) is divided across all active Windmill backends using a GetWorkBudgetDistributor. The default strategy distributes the budget evenly.
  • Dynamic Endpoint Management: When the GetWorkerMetadata stream delivers new endpoint assignments, the harness closes streams to backends that are no longer active and opens streams to newly assigned backends.
  • Per-Backend Streams: Each Windmill backend endpoint gets its own set of GetWork, GetData, and CommitWork streams, managed by a WindmillStreamSender.
  • Channel Caching: gRPC channels to Windmill backends are cached by the ChannelCachingStubFactory to avoid repeated channel creation overhead.

Usage

Windmill connection setup is required for every Dataflow streaming pipeline and is automatically performed during worker startup. Understanding this process is important for:

  • Diagnosing connectivity issues: If a worker cannot connect to Windmill, no work will be processed. Common issues include network configuration problems, authentication failures, or endpoint resolution errors.
  • Understanding work distribution: The GetWork budget directly affects how much work a single worker can hold. The budget is configured via maxBundlesOutstanding (item count) and is capped at 64 MB for bytes.
  • Tuning performance: The number of commit threads (windmillServiceCommitThreads) and GetData stream count (windmillGetDataStreamCount) affect throughput. More streams can improve parallelism but consume more resources.
  • Understanding failover: When Windmill backends change (due to scaling or failures), the fan-out harness handles the transition by closing old streams and opening new ones, guided by metadata version numbers to ensure monotonic progress.

Theoretical Basis

Windmill connection setup is based on the streaming shuffle architecture where a persistent state service coordinates work distribution across workers. The key theoretical foundations include:

  • Bidirectional gRPC Streaming: Unlike request-response RPC, bidirectional streaming maintains persistent connections that allow both sides to send messages independently. This provides low-latency communication essential for streaming workloads where new work items arrive continuously.
  • Budget-Based Flow Control: The GetWork budget implements a credit-based flow control mechanism. Workers advertise how much work they can accept, and Windmill respects these limits. This prevents workers from being overwhelmed and provides natural backpressure.
  • Consistent Hashing and Key Ranges: Windmill assigns work to backends based on key ranges. The fan-out architecture ensures workers connect to all backends that may hold keys relevant to their assigned computations.
  • Lease-Based Work Assignment: Work items are leased to workers via GetWork streams. If a worker fails to commit or heartbeat a work item within the lease duration, Windmill reassigns it to another worker, providing fault tolerance.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment