Workflow:Apache Beam Local Pipeline Execution

Knowledge Sources	Apache Beam Beam Programming Guide Direct Runner Docs
Domains	Data_Engineering, Pipeline_Execution, Testing
Last Updated	2026-02-09 04:30 GMT

Overview

End-to-end process for executing an Apache Beam pipeline locally using the Direct Runner for development, testing, and pipeline correctness validation.

Description

This workflow covers the complete lifecycle of running a Beam pipeline on a single JVM using the Direct Runner. The Direct Runner is the default runner for local development and testing. It applies transform overrides to decompose composite transforms, validates display data and model invariants, builds an internal execution graph, and processes elements in parallel using a thread pool. The runner enforces strict Beam model semantics including element immutability and coder correctness, making it ideal for catching pipeline bugs before deploying to a distributed runner.

Usage

Execute this workflow when you need to develop, debug, or validate a Beam pipeline locally before deploying it to a distributed runner such as Dataflow or Flink. This is the standard development loop for Beam pipeline authors: construct the pipeline in code, run it with the Direct Runner, inspect the results, and iterate.

Execution Steps

Step 1: Pipeline Construction

Define the pipeline using the Beam SDK. Create a Pipeline object with PipelineOptions (defaulting to DirectRunner), then chain transforms such as Read, ParDo, GroupByKey, Combine, and Write to build the processing DAG. Each transform produces PCollections that serve as inputs to subsequent transforms.

Key considerations:

Set the runner to DirectRunner (or leave as default)
Configure DirectOptions for parallelism and model enforcement settings
Ensure all user DoFns, coders, and window functions are serializable

Step 2: Transform Override Application

When the pipeline is submitted to the Direct Runner, it applies a series of transform overrides that decompose high-level transforms into lower-level primitives the runner can execute directly. GroupByKey is split into GroupByKeyOnly and GroupAlsoByWindow stages. SplittableDoFn is decomposed into restriction tracking and element processing. WriteFiles is overridden to add sharding logic.

Key considerations:

Overrides ensure that the Direct Runner can handle all Beam model primitives
ParDoMultiOverrideFactory handles SplittableDoFn decomposition
WriteWithShardingFactory adds runner-determined sharding for file writes

Step 3: Pipeline Graph Construction

The pipeline is traversed by a DirectGraphVisitor to build a DirectGraph, which is an internal representation of the execution DAG. The visitor collects all transforms and their input/output PCollections, identifies root transforms (those without inputs, such as Read and Impulse), and maps each PCollection to its producing transform.

Key considerations:

The DirectGraph is an immutable snapshot used throughout execution
Root transforms are the starting points for the execution engine
KeyedPValueTrackingVisitor identifies which PCollections carry keyed data

Step 4: Model Enforcement Configuration

The runner configures enforcement checks that validate Beam model invariants at runtime. Immutability enforcement ensures that elements are not mutated after being output. Encodability enforcement validates that all elements can be round-tripped through their coders. These checks can be enabled or disabled via DirectOptions.

Key considerations:

ImmutabilityEnforcementFactory checks that input elements are not modified during processing
ImmutabilityCheckingBundleFactory verifies output elements are not mutated after emission
CloningBundleFactory can be used to defensively copy elements

Step 5: Parallel Execution

The ExecutorServiceParallelExecutor drives the pipeline to completion using a thread pool. It creates TransformExecutor tasks for each pending bundle of work, maintains per-key serialization for stateful transforms, and uses a QuiescenceDriver to schedule new work as watermarks advance and timers fire. Execution continues until all transforms reach quiescence (no pending work remains).

Key considerations:

Stateless transforms execute in parallel across all available threads
Stateful transforms are serialized per StepAndKey to ensure consistency
The WatermarkManager tracks per-transform watermarks to determine when windows close
The EvaluationContext coordinates bundles, state, side inputs, and metrics

Step 6: Result Collection

Once all transforms have reached quiescence and the pipeline completes, the Direct Runner returns a DirectPipelineResult. This result object provides access to pipeline metrics (counters, distributions, gauges), the final pipeline state (DONE, FAILED, or CANCELLED), and allows blocking until completion via waitUntilFinish. If any transform failed, the exception is propagated to the caller.

Key considerations:

DirectMetrics aggregates metrics from all completed bundles
The result can be configured to block until completion or return immediately
Failed transforms produce UserCodeException with the original stack trace

Execution Diagram

GitHub URL

Workflow Repository