Workflow:Apache Beam Local Pipeline Execution
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Pipeline_Execution, Testing |
| Last Updated | 2026-02-09 04:30 GMT |
Overview
End-to-end process for executing an Apache Beam pipeline locally using the Direct Runner for development, testing, and pipeline correctness validation.
Description
This workflow covers the complete lifecycle of running a Beam pipeline on a single JVM using the Direct Runner. The Direct Runner is the default runner for local development and testing. It applies transform overrides to decompose composite transforms, validates display data and model invariants, builds an internal execution graph, and processes elements in parallel using a thread pool. The runner enforces strict Beam model semantics including element immutability and coder correctness, making it ideal for catching pipeline bugs before deploying to a distributed runner.
Usage
Execute this workflow when you need to develop, debug, or validate a Beam pipeline locally before deploying it to a distributed runner such as Dataflow or Flink. This is the standard development loop for Beam pipeline authors: construct the pipeline in code, run it with the Direct Runner, inspect the results, and iterate.
Execution Steps
Step 1: Pipeline Construction
Define the pipeline using the Beam SDK. Create a Pipeline object with PipelineOptions (defaulting to DirectRunner), then chain transforms such as Read, ParDo, GroupByKey, Combine, and Write to build the processing DAG. Each transform produces PCollections that serve as inputs to subsequent transforms.
Key considerations:
- Set the runner to DirectRunner (or leave as default)
- Configure DirectOptions for parallelism and model enforcement settings
- Ensure all user DoFns, coders, and window functions are serializable
Step 2: Transform Override Application
When the pipeline is submitted to the Direct Runner, it applies a series of transform overrides that decompose high-level transforms into lower-level primitives the runner can execute directly. GroupByKey is split into GroupByKeyOnly and GroupAlsoByWindow stages. SplittableDoFn is decomposed into restriction tracking and element processing. WriteFiles is overridden to add sharding logic.
Key considerations:
- Overrides ensure that the Direct Runner can handle all Beam model primitives
- ParDoMultiOverrideFactory handles SplittableDoFn decomposition
- WriteWithShardingFactory adds runner-determined sharding for file writes
Step 3: Pipeline Graph Construction
The pipeline is traversed by a DirectGraphVisitor to build a DirectGraph, which is an internal representation of the execution DAG. The visitor collects all transforms and their input/output PCollections, identifies root transforms (those without inputs, such as Read and Impulse), and maps each PCollection to its producing transform.
Key considerations:
- The DirectGraph is an immutable snapshot used throughout execution
- Root transforms are the starting points for the execution engine
- KeyedPValueTrackingVisitor identifies which PCollections carry keyed data
Step 4: Model Enforcement Configuration
The runner configures enforcement checks that validate Beam model invariants at runtime. Immutability enforcement ensures that elements are not mutated after being output. Encodability enforcement validates that all elements can be round-tripped through their coders. These checks can be enabled or disabled via DirectOptions.
Key considerations:
- ImmutabilityEnforcementFactory checks that input elements are not modified during processing
- ImmutabilityCheckingBundleFactory verifies output elements are not mutated after emission
- CloningBundleFactory can be used to defensively copy elements
Step 5: Parallel Execution
The ExecutorServiceParallelExecutor drives the pipeline to completion using a thread pool. It creates TransformExecutor tasks for each pending bundle of work, maintains per-key serialization for stateful transforms, and uses a QuiescenceDriver to schedule new work as watermarks advance and timers fire. Execution continues until all transforms reach quiescence (no pending work remains).
Key considerations:
- Stateless transforms execute in parallel across all available threads
- Stateful transforms are serialized per StepAndKey to ensure consistency
- The WatermarkManager tracks per-transform watermarks to determine when windows close
- The EvaluationContext coordinates bundles, state, side inputs, and metrics
Step 6: Result Collection
Once all transforms have reached quiescence and the pipeline completes, the Direct Runner returns a DirectPipelineResult. This result object provides access to pipeline metrics (counters, distributions, gauges), the final pipeline state (DONE, FAILED, or CANCELLED), and allows blocking until completion via waitUntilFinish. If any transform failed, the exception is propagated to the caller.
Key considerations:
- DirectMetrics aggregates metrics from all completed bundles
- The result can be configured to block until completion or return immediately
- Failed transforms produce UserCodeException with the original stack trace