Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Beam Computation Configuration

From Leeroopedia


Field Value
Principle Name Computation Configuration
Domain Distributed_Systems, Configuration_Management
Overview Mechanism for dynamically fetching and caching computation configurations that map pipeline stages to their execution specifications.
Related Implementation Implementation:Apache_Beam_StreamingEngineComputationConfigFetcher
Repository apache/beam
last_updated 2026-02-09 04:00 GMT

Overview

Computation configuration is the mechanism by which a Dataflow streaming worker dynamically fetches and caches the configurations that map pipeline stages to their execution specifications. Each computation represents a unit of pipeline execution, and its configuration defines the transform graph, state families, and system name that the worker needs in order to execute work items for that computation.

Description

In Dataflow Streaming, computations are the fundamental unit of pipeline execution. A pipeline is decomposed into multiple computations, each corresponding to a fused group of transforms. The computation configuration system is responsible for:

1. Configuration Structure: Each ComputationConfig contains:

  • MapTask: The specification of the transform graph for this computation, including system name, stage name, and the list of instructions (ParDo operations, GroupByKey, etc.).
  • Transform-to-StateFamily Mapping: A map from user-defined transform names to the state family names used by Windmill for state storage.
  • User Step-to-StateFamily Mapping: A secondary mapping for user steps to state families, used for backward compatibility.

2. Configuration Fetching: The StreamingEngineComputationConfigFetcher fetches configuration from the Dataflow service in two modes:

  • Initial Global Config: At startup, the fetcher blocks until it receives the initial global pipeline configuration from the Dataflow service. This configuration includes Windmill service endpoints, operational limits (max commit bytes, max key bytes, max output bytes), and tag encoding version.
  • Periodic Refresh: After startup, a background thread periodically re-fetches the global configuration to pick up changes such as new Windmill endpoints or updated operational limits.
  • On-Demand Computation Config: When the worker encounters a work item for an unknown computation ID, it fetches that specific computation's configuration from the Dataflow service.

3. Configuration Caching: The ComputationStateCache stores active computation states. When a new configuration is fetched, it creates a ComputationState object that holds the MapTask, state families, and an active work map for tracking in-progress work items.

4. DoFn Dispatching: The DefaultParDoFnFactory uses the computation configuration to instantiate the correct DoFn implementation. It maintains a dispatch table mapping class names to specialized factories:

  • "DoFn" dispatches to UserParDoFnFactory for user-defined transforms.
  • "CombineValuesFn" dispatches to CombineValuesFnFactory.
  • "MergeBucketsDoFn" and "MergeWindowsDoFn" dispatch to GroupAlsoByWindowParDoFnFactory.
  • "AssignBucketsDoFn" and "AssignWindowsDoFn" dispatch to AssignWindowsParDoFnFactory.
  • "SplittableProcessFn" dispatches to SplittableProcessFnFactory.

5. Retry with Backoff: Configuration fetching uses exponential backoff with jitter (initial backoff of 100ms, max backoff of 1 minute, max cumulative backoff of 5 minutes) to handle transient failures when communicating with the Dataflow service.

Usage

Computation configuration is automatic in Dataflow streaming. Understanding it helps with:

  • Diagnosing stage mapping failures: If a work item arrives for an unknown computation ID, the worker must fetch its configuration. If the fetch fails, the work item cannot be processed.
  • Understanding configuration staleness: The periodic refresh interval (globalConfigRefreshPeriodMillis) controls how quickly workers pick up configuration changes. A longer interval means slower response to endpoint changes.
  • Debugging DoFn instantiation: If a DoFn fails to instantiate, the DefaultParDoFnFactory dispatch table and the UserParDoFnFactory deserialization path are the first places to investigate.
  • Operational limits: The global configuration includes operational limits such as maxWorkItemCommitBytes, maxOutputKeyBytes, and maxOutputValueBytes. These limits are enforced during work item processing and can cause failures if exceeded.

Theoretical Basis

Computation configuration is based on the separation of control plane and data plane in distributed systems:

  • Control Plane: The Dataflow service acts as the control plane, maintaining the authoritative configuration for all computations in a pipeline. It dynamically assigns work specifications to workers and can update configurations (e.g., changing Windmill endpoints) without restarting workers.
  • Data Plane: The worker acts as the data plane, executing work items according to the configurations received from the control plane. It caches configurations locally for performance but refreshes them periodically to stay synchronized.

This separation provides several benefits:

  • Dynamic Reconfiguration: Workers can pick up new computations or endpoint changes without restart.
  • Scalability: The control plane can serve configuration to many workers concurrently.
  • Fault Isolation: A transient control plane failure does not immediately halt data plane processing, as workers continue with cached configurations.

The dispatch pattern used by DefaultParDoFnFactory follows the strategy pattern, allowing different execution strategies for different transform types while maintaining a uniform interface.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment