Workflow:Apache Beam Portable Pipeline Submission
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Pipeline_Execution, Portability |
| Last Updated | 2026-02-09 04:30 GMT |
Overview
End-to-end process for submitting an Apache Beam pipeline to a remote runner backend using the Portable Framework, covering pipeline serialization, job service communication, artifact staging, and execution monitoring.
Description
This workflow describes how the Beam Portability Framework enables pipelines written in any SDK to execute on any compatible runner. The PortableRunner translates the SDK pipeline into a Runner API protocol buffer, connects to a remote Job Service via gRPC, stages required artifacts (JARs, Python packages), and monitors execution through state polling. The Job Service (InMemoryJobService) manages the lifecycle of submitted jobs and delegates execution to a runner-specific JobInvoker. This architecture decouples the SDK from the runner, enabling cross-language pipelines and runner interoperability.
Usage
Execute this workflow when you need to run a Beam pipeline on a distributed runner (Flink, Spark, etc.) using the portable framework. This is the standard approach for cross-language pipelines, multi-SDK environments, or when using a runner that supports the Beam Job API. It is also the foundation for packaging pipelines as self-contained executable JARs.
Execution Steps
Step 1: Pipeline Translation
The PortableRunner translates the user's SDK-level Pipeline object into a RunnerApi.Pipeline protocol buffer. This proto-based representation is language-independent and captures the complete pipeline graph including transforms, PCollections, coders, windowing strategies, and environment specifications. The runner detects and configures the execution environment type (Docker, process, or external).
Key considerations:
- The RunnerApi.Pipeline proto is the lingua franca between SDKs and runners
- Environment configuration determines how user code is packaged and executed
- Classpath resources are detected automatically for staging
Step 2: Job Service Connection
The PortableRunner opens a gRPC channel to the Job Service endpoint specified in pipeline options. The Job Service may be a standalone process (started via JobServerDriver) or an embedded service. The client authenticates and establishes the communication channel for subsequent prepare/run RPCs.
Key considerations:
- JobServerDriver provides the bootstrap logic for starting job service servers
- The endpoint is configured via PortableRunner pipeline options
- The Job Service supports prepare, run, cancel, getState, and getMessages RPCs
Step 3: Job Preparation
The client sends a Prepare RPC to the Job Service with the serialized pipeline proto. The InMemoryJobService creates a JobPreparation record, assigns a unique job ID, and sets up an ArtifactStagingService endpoint for the client to upload dependencies. The service validates the pipeline and prepares it for execution.
Key considerations:
- Each job receives a unique preparation ID and staging token
- The pipeline proto includes all necessary metadata for execution
- The InMemoryJobService stores job state in ConcurrentHashMaps
Step 4: Artifact Staging
The client uploads all required artifacts (JAR files, Python packages, other dependencies) to the ArtifactStagingService endpoint provided during preparation. For Java pipelines, this typically includes the pipeline JAR and all transitive dependencies. The PortablePipelineJarCreator can alternatively bundle everything into a single executable JAR.
Key considerations:
- Artifacts are uploaded via the ArtifactStagingService gRPC endpoint
- PortablePipelineJarCreator bundles pipeline proto and dependencies into a single JAR
- PortablePipelineJarUtils provides utilities for reading/writing pipeline JARs
- The staging directory is configurable via job server options
Step 5: Job Execution
The client sends a Run RPC with the preparation ID. The InMemoryJobService creates a JobInvocation via the runner-specific JobInvoker, which translates the portable pipeline to the target runner's execution model and submits it. The JobInvocation manages the running job's lifecycle, tracking state transitions (PREPARING, RUNNING, DONE, FAILED, CANCELLED).
Key considerations:
- JobInvoker is the runner-specific bridge (e.g., FlinkJobInvoker, SparkJobInvoker)
- JobInvocation wraps the running job and provides state/message streaming
- The job runs asynchronously; the client receives a job ID for monitoring
Step 6: Execution Monitoring
The client polls the Job Service for state updates via GetState RPCs and receives log messages via GetMessages streaming RPCs. The JobServicePipelineResult wraps these interactions, providing a familiar PipelineResult interface with waitUntilFinish, getState, and metrics access. The client can also cancel the job via a Cancel RPC.
Key considerations:
- JobServicePipelineResult polls for state changes at configurable intervals
- PortableMetrics converts portable metric protos to SDK MetricResults format
- The client can block until completion or run asynchronously