Principle:Apache Beam Result Collection

Field	Value
Principle Name	Result Collection
Overview	Mechanism for collecting pipeline execution results including terminal state, metrics, and error information after execution completes.
Domains	Data_Processing, Monitoring
Related Implementation	Implementation:Apache_Beam_DirectPipelineResult
last_updated	2026-02-09 04:00 GMT

Description

After a pipeline finishes executing -- either successfully, with failure, or by cancellation -- the runner must provide a result object that captures the outcome of execution. This result bridges the gap between the asynchronous execution engine (which processes bundles in parallel across a thread pool) and the synchronous user API (where the caller typically wants to block until completion and then inspect results).

The result collection mechanism provides three categories of information:

1. Terminal State

The pipeline's terminal state is one of:

State	Description
`DONE`	The pipeline completed successfully. All input was consumed, all transforms completed, and all watermarks advanced to positive infinity.
`FAILED`	The pipeline terminated due to an error. User code threw an exception, or an internal error occurred during execution.
`CANCELLED`	The pipeline was cancelled by the user via `cancel()`.
`RUNNING`	The pipeline is still executing (non-terminal state). This is the initial state after `run()` returns if `blockOnRun` is `false`.

The state is accessible via PipelineResult.getState() and may be polled during execution. For blocking execution (the default for the Direct Runner), run() calls waitUntilFinish() internally, so by the time the user receives the result, it is already in a terminal state.

2. Aggregated Metrics

The result provides access to pipeline metrics through PipelineResult.metrics(), returning a MetricResults object that supports queried access to:

Counters -- Monotonically increasing or decreasing integer values (e.g., number of records processed, number of errors).
Distributions -- Statistical distributions of values (sum, count, min, max, mean).
Gauges -- Point-in-time values representing the latest state of a measurement.
String sets -- Sets of string values aggregated across all bundles.
Bounded tries -- Hierarchical string aggregation structures.

Metrics are collected at two levels:

Attempted metrics -- Metrics from all bundles that were processed, including those that may have failed and been retried.
Committed metrics -- Metrics from only the bundles that were successfully committed. This is the authoritative value for correctness.

Metrics are queryable via MetricsFilter, allowing users to select metrics by name, namespace, or step name.

3. Completion Waiting and Cancellation

The result object provides mechanisms for synchronizing with the asynchronous execution:

waitUntilFinish() -- Blocks the calling thread until the pipeline reaches a terminal state. Returns the terminal state.
waitUntilFinish(Duration duration) -- Blocks for at most the specified duration. Returns the terminal state if reached, or null if the timeout elapsed.
cancel() -- Requests cancellation of the running pipeline. Returns the resulting state.

If the pipeline fails with a user code exception, waitUntilFinish() rethrows the exception wrapped in a PipelineExecutionException. The stack trace is truncated to start at the waitUntilFinish() call site, making it easier to locate the error in user code.

Usage

Result collection is used in the following scenarios:

Blocking execution -- The most common pattern: call pipeline.run().waitUntilFinish() to execute the pipeline and block until completion. The Direct Runner does this automatically when blockOnRun is true (the default).
Asynchronous execution -- Set blockOnRun to false, call run(), and later call waitUntilFinish() on the result. This allows the calling thread to perform other work while the pipeline executes.
Metrics collection -- After execution completes, query the MetricResults to extract counters, distributions, and gauges for monitoring, logging, or testing assertions.
Error handling -- Catch PipelineExecutionException from waitUntilFinish() to handle pipeline failures gracefully.
Pipeline cancellation -- Call cancel() on the result to stop a running pipeline (e.g., in response to a user signal or timeout).
Test assertions -- In unit tests, retrieve metrics to assert that expected processing occurred (e.g., "counter X equals 100" or "no errors were recorded").

Theoretical Basis

The result collection mechanism follows the Future pattern from concurrent programming -- a handle to a potentially incomplete computation that provides access to the outcome once available.

Key theoretical aspects:

Asynchronous result handle -- The PipelineResult object is returned immediately by run(), representing the in-progress computation. The caller can choose to block (waitUntilFinish()) or poll (getState()).
Terminal state machine -- The state transitions follow a strict partial order: RUNNING can transition to exactly one of DONE, FAILED, or CANCELLED. Once a terminal state is reached, it is permanent (the isTerminal() check). This ensures that state queries after completion are stable.
Metric aggregation model -- The attempted/committed metric dichotomy mirrors the two-phase commit protocol used in distributed systems. Attempted metrics are speculative (they may include work from failed bundles), while committed metrics represent the durable outcome.
Exception propagation -- User code exceptions are propagated through the UserCodeException wrapper, which preserves the original cause while providing a clean stack trace rooted at the user's call to waitUntilFinish().

Related Pages

Implementation:Apache_Beam_DirectPipelineResult -- Concrete tool for collecting Direct Runner pipeline execution results and metrics.

Sources

Doc -- Apache Beam Pipeline Execution -- Official documentation on running pipelines and collecting results.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment