Workflow:Apache Beam Portable Pipeline Submission

Knowledge Sources	Apache Beam Beam Portability Framework Beam Runner Authoring Guide
Domains	Data_Engineering, Pipeline_Execution, Portability
Last Updated	2026-02-09 04:30 GMT

Overview

End-to-end process for submitting an Apache Beam pipeline to a remote runner backend using the Portable Framework, covering pipeline serialization, job service communication, artifact staging, and execution monitoring.

Description

This workflow describes how the Beam Portability Framework enables pipelines written in any SDK to execute on any compatible runner. The PortableRunner translates the SDK pipeline into a Runner API protocol buffer, connects to a remote Job Service via gRPC, stages required artifacts (JARs, Python packages), and monitors execution through state polling. The Job Service (InMemoryJobService) manages the lifecycle of submitted jobs and delegates execution to a runner-specific JobInvoker. This architecture decouples the SDK from the runner, enabling cross-language pipelines and runner interoperability.

Usage

Execute this workflow when you need to run a Beam pipeline on a distributed runner (Flink, Spark, etc.) using the portable framework. This is the standard approach for cross-language pipelines, multi-SDK environments, or when using a runner that supports the Beam Job API. It is also the foundation for packaging pipelines as self-contained executable JARs.

Execution Steps

Step 1: Pipeline Translation

The PortableRunner translates the user's SDK-level Pipeline object into a RunnerApi.Pipeline protocol buffer. This proto-based representation is language-independent and captures the complete pipeline graph including transforms, PCollections, coders, windowing strategies, and environment specifications. The runner detects and configures the execution environment type (Docker, process, or external).

Key considerations:

The RunnerApi.Pipeline proto is the lingua franca between SDKs and runners
Environment configuration determines how user code is packaged and executed
Classpath resources are detected automatically for staging

Step 2: Job Service Connection

The PortableRunner opens a gRPC channel to the Job Service endpoint specified in pipeline options. The Job Service may be a standalone process (started via JobServerDriver) or an embedded service. The client authenticates and establishes the communication channel for subsequent prepare/run RPCs.

Key considerations:

JobServerDriver provides the bootstrap logic for starting job service servers
The endpoint is configured via PortableRunner pipeline options
The Job Service supports prepare, run, cancel, getState, and getMessages RPCs

Step 3: Job Preparation

The client sends a Prepare RPC to the Job Service with the serialized pipeline proto. The InMemoryJobService creates a JobPreparation record, assigns a unique job ID, and sets up an ArtifactStagingService endpoint for the client to upload dependencies. The service validates the pipeline and prepares it for execution.

Key considerations:

Each job receives a unique preparation ID and staging token
The pipeline proto includes all necessary metadata for execution
The InMemoryJobService stores job state in ConcurrentHashMaps

Step 4: Artifact Staging

The client uploads all required artifacts (JAR files, Python packages, other dependencies) to the ArtifactStagingService endpoint provided during preparation. For Java pipelines, this typically includes the pipeline JAR and all transitive dependencies. The PortablePipelineJarCreator can alternatively bundle everything into a single executable JAR.

Key considerations:

Artifacts are uploaded via the ArtifactStagingService gRPC endpoint
PortablePipelineJarCreator bundles pipeline proto and dependencies into a single JAR
PortablePipelineJarUtils provides utilities for reading/writing pipeline JARs
The staging directory is configurable via job server options

Step 5: Job Execution

The client sends a Run RPC with the preparation ID. The InMemoryJobService creates a JobInvocation via the runner-specific JobInvoker, which translates the portable pipeline to the target runner's execution model and submits it. The JobInvocation manages the running job's lifecycle, tracking state transitions (PREPARING, RUNNING, DONE, FAILED, CANCELLED).

Key considerations:

JobInvoker is the runner-specific bridge (e.g., FlinkJobInvoker, SparkJobInvoker)
JobInvocation wraps the running job and provides state/message streaming
The job runs asynchronously; the client receives a job ID for monitoring

Step 6: Execution Monitoring

The client polls the Job Service for state updates via GetState RPCs and receives log messages via GetMessages streaming RPCs. The JobServicePipelineResult wraps these interactions, providing a familiar PipelineResult interface with waitUntilFinish, getState, and metrics access. The client can also cancel the job via a Cancel RPC.

Key considerations:

JobServicePipelineResult polls for state changes at configurable intervals
PortableMetrics converts portable metric protos to SDK MetricResults format
The client can block until completion or run asynchronously

Execution Diagram

GitHub URL

Workflow Repository