Principle:Apache Beam Result Collection
| Field | Value |
|---|---|
| Principle Name | Result Collection |
| Overview | Mechanism for collecting pipeline execution results including terminal state, metrics, and error information after execution completes. |
| Domains | Data_Processing, Monitoring |
| Related Implementation | Implementation:Apache_Beam_DirectPipelineResult |
| last_updated | 2026-02-09 04:00 GMT |
Description
After a pipeline finishes executing -- either successfully, with failure, or by cancellation -- the runner must provide a result object that captures the outcome of execution. This result bridges the gap between the asynchronous execution engine (which processes bundles in parallel across a thread pool) and the synchronous user API (where the caller typically wants to block until completion and then inspect results).
The result collection mechanism provides three categories of information:
1. Terminal State
The pipeline's terminal state is one of:
| State | Description |
|---|---|
DONE |
The pipeline completed successfully. All input was consumed, all transforms completed, and all watermarks advanced to positive infinity. |
FAILED |
The pipeline terminated due to an error. User code threw an exception, or an internal error occurred during execution. |
CANCELLED |
The pipeline was cancelled by the user via cancel().
|
RUNNING |
The pipeline is still executing (non-terminal state). This is the initial state after run() returns if blockOnRun is false.
|
The state is accessible via PipelineResult.getState() and may be polled during execution. For blocking execution (the default for the Direct Runner), run() calls waitUntilFinish() internally, so by the time the user receives the result, it is already in a terminal state.
2. Aggregated Metrics
The result provides access to pipeline metrics through PipelineResult.metrics(), returning a MetricResults object that supports queried access to:
- Counters -- Monotonically increasing or decreasing integer values (e.g., number of records processed, number of errors).
- Distributions -- Statistical distributions of values (sum, count, min, max, mean).
- Gauges -- Point-in-time values representing the latest state of a measurement.
- String sets -- Sets of string values aggregated across all bundles.
- Bounded tries -- Hierarchical string aggregation structures.
Metrics are collected at two levels:
- Attempted metrics -- Metrics from all bundles that were processed, including those that may have failed and been retried.
- Committed metrics -- Metrics from only the bundles that were successfully committed. This is the authoritative value for correctness.
Metrics are queryable via MetricsFilter, allowing users to select metrics by name, namespace, or step name.
3. Completion Waiting and Cancellation
The result object provides mechanisms for synchronizing with the asynchronous execution:
waitUntilFinish()-- Blocks the calling thread until the pipeline reaches a terminal state. Returns the terminal state.waitUntilFinish(Duration duration)-- Blocks for at most the specified duration. Returns the terminal state if reached, ornullif the timeout elapsed.cancel()-- Requests cancellation of the running pipeline. Returns the resulting state.
If the pipeline fails with a user code exception, waitUntilFinish() rethrows the exception wrapped in a PipelineExecutionException. The stack trace is truncated to start at the waitUntilFinish() call site, making it easier to locate the error in user code.
Usage
Result collection is used in the following scenarios:
- Blocking execution -- The most common pattern: call
pipeline.run().waitUntilFinish()to execute the pipeline and block until completion. The Direct Runner does this automatically whenblockOnRunistrue(the default). - Asynchronous execution -- Set
blockOnRuntofalse, callrun(), and later callwaitUntilFinish()on the result. This allows the calling thread to perform other work while the pipeline executes. - Metrics collection -- After execution completes, query the
MetricResultsto extract counters, distributions, and gauges for monitoring, logging, or testing assertions. - Error handling -- Catch
PipelineExecutionExceptionfromwaitUntilFinish()to handle pipeline failures gracefully. - Pipeline cancellation -- Call
cancel()on the result to stop a running pipeline (e.g., in response to a user signal or timeout). - Test assertions -- In unit tests, retrieve metrics to assert that expected processing occurred (e.g., "counter X equals 100" or "no errors were recorded").
Theoretical Basis
The result collection mechanism follows the Future pattern from concurrent programming -- a handle to a potentially incomplete computation that provides access to the outcome once available.
Key theoretical aspects:
- Asynchronous result handle -- The
PipelineResultobject is returned immediately byrun(), representing the in-progress computation. The caller can choose to block (waitUntilFinish()) or poll (getState()). - Terminal state machine -- The state transitions follow a strict partial order:
RUNNINGcan transition to exactly one ofDONE,FAILED, orCANCELLED. Once a terminal state is reached, it is permanent (theisTerminal()check). This ensures that state queries after completion are stable. - Metric aggregation model -- The attempted/committed metric dichotomy mirrors the two-phase commit protocol used in distributed systems. Attempted metrics are speculative (they may include work from failed bundles), while committed metrics represent the durable outcome.
- Exception propagation -- User code exceptions are propagated through the
UserCodeExceptionwrapper, which preserves the original cause while providing a clean stack trace rooted at the user's call towaitUntilFinish().
Related Pages
- Implementation:Apache_Beam_DirectPipelineResult -- Concrete tool for collecting Direct Runner pipeline execution results and metrics.
Sources
- Doc -- Apache Beam Pipeline Execution -- Official documentation on running pipelines and collecting results.