Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Beam Portable Pipeline Submission

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Pipeline_Execution, Portability
Last Updated 2026-02-09 04:30 GMT

Overview

End-to-end process for submitting an Apache Beam pipeline to a remote runner backend using the Portable Framework, covering pipeline serialization, job service communication, artifact staging, and execution monitoring.

Description

This workflow describes how the Beam Portability Framework enables pipelines written in any SDK to execute on any compatible runner. The PortableRunner translates the SDK pipeline into a Runner API protocol buffer, connects to a remote Job Service via gRPC, stages required artifacts (JARs, Python packages), and monitors execution through state polling. The Job Service (InMemoryJobService) manages the lifecycle of submitted jobs and delegates execution to a runner-specific JobInvoker. This architecture decouples the SDK from the runner, enabling cross-language pipelines and runner interoperability.

Usage

Execute this workflow when you need to run a Beam pipeline on a distributed runner (Flink, Spark, etc.) using the portable framework. This is the standard approach for cross-language pipelines, multi-SDK environments, or when using a runner that supports the Beam Job API. It is also the foundation for packaging pipelines as self-contained executable JARs.

Execution Steps

Step 1: Pipeline Translation

The PortableRunner translates the user's SDK-level Pipeline object into a RunnerApi.Pipeline protocol buffer. This proto-based representation is language-independent and captures the complete pipeline graph including transforms, PCollections, coders, windowing strategies, and environment specifications. The runner detects and configures the execution environment type (Docker, process, or external).

Key considerations:

  • The RunnerApi.Pipeline proto is the lingua franca between SDKs and runners
  • Environment configuration determines how user code is packaged and executed
  • Classpath resources are detected automatically for staging

Step 2: Job Service Connection

The PortableRunner opens a gRPC channel to the Job Service endpoint specified in pipeline options. The Job Service may be a standalone process (started via JobServerDriver) or an embedded service. The client authenticates and establishes the communication channel for subsequent prepare/run RPCs.

Key considerations:

  • JobServerDriver provides the bootstrap logic for starting job service servers
  • The endpoint is configured via PortableRunner pipeline options
  • The Job Service supports prepare, run, cancel, getState, and getMessages RPCs

Step 3: Job Preparation

The client sends a Prepare RPC to the Job Service with the serialized pipeline proto. The InMemoryJobService creates a JobPreparation record, assigns a unique job ID, and sets up an ArtifactStagingService endpoint for the client to upload dependencies. The service validates the pipeline and prepares it for execution.

Key considerations:

  • Each job receives a unique preparation ID and staging token
  • The pipeline proto includes all necessary metadata for execution
  • The InMemoryJobService stores job state in ConcurrentHashMaps

Step 4: Artifact Staging

The client uploads all required artifacts (JAR files, Python packages, other dependencies) to the ArtifactStagingService endpoint provided during preparation. For Java pipelines, this typically includes the pipeline JAR and all transitive dependencies. The PortablePipelineJarCreator can alternatively bundle everything into a single executable JAR.

Key considerations:

  • Artifacts are uploaded via the ArtifactStagingService gRPC endpoint
  • PortablePipelineJarCreator bundles pipeline proto and dependencies into a single JAR
  • PortablePipelineJarUtils provides utilities for reading/writing pipeline JARs
  • The staging directory is configurable via job server options

Step 5: Job Execution

The client sends a Run RPC with the preparation ID. The InMemoryJobService creates a JobInvocation via the runner-specific JobInvoker, which translates the portable pipeline to the target runner's execution model and submits it. The JobInvocation manages the running job's lifecycle, tracking state transitions (PREPARING, RUNNING, DONE, FAILED, CANCELLED).

Key considerations:

  • JobInvoker is the runner-specific bridge (e.g., FlinkJobInvoker, SparkJobInvoker)
  • JobInvocation wraps the running job and provides state/message streaming
  • The job runs asynchronously; the client receives a job ID for monitoring

Step 6: Execution Monitoring

The client polls the Job Service for state updates via GetState RPCs and receives log messages via GetMessages streaming RPCs. The JobServicePipelineResult wraps these interactions, providing a familiar PipelineResult interface with waitUntilFinish, getState, and metrics access. The client can also cancel the job via a Cancel RPC.

Key considerations:

  • JobServicePipelineResult polls for state changes at configurable intervals
  • PortableMetrics converts portable metric protos to SDK MetricResults format
  • The client can block until completion or run asynchronously

Execution Diagram

GitHub URL

Workflow Repository