Principle:Apache Beam Artifact Staging

Property	Value
Principle Name	Artifact Staging
Category	Pipeline_Orchestration, Packaging
Related Implementation	Implementation:Apache_Beam_PortablePipelineJarCreator
Source Reference	runners/java-job-service/.../PortablePipelineJarCreator.java
last_updated	2026-02-09 04:00 GMT

Overview

Artifact Staging is the process of packaging pipeline dependencies, artifacts, and the pipeline definition into a deployable format for remote execution. It bridges the gap between the client-side development environment and the runner-side execution environment.

Description

Artifact staging transfers pipeline dependencies from the client to the runner environment. Two primary approaches exist in the Beam framework:

Approach 1: gRPC-Based Staging

The standard approach uploads artifacts to a staging service endpoint using a session token obtained during job preparation. The ArtifactStagingService coordinates the transfer:

// From PortableRunner.run() at L193-206
try (CloseableResource<ManagedChannel> artifactChannel =
    CloseableResource.of(
        channelFactory.forDescriptor(artifactStagingEndpoint), ManagedChannel::shutdown)) {

    ArtifactStagingService.offer(
        new ArtifactRetrievalService(),
        ArtifactStagingServiceGrpc.newStub(artifactChannel.get()),
        stagingSessionToken);
}

In this approach:

The client opens a gRPC channel to the artifact staging endpoint returned by the prepare RPC.
The ArtifactRetrievalService reads artifacts from the client's local environment.
The ArtifactStagingService.offer() method streams each artifact to the staging server.
The staging session token authorizes the upload and associates artifacts with the correct preparation.

Approach 2: JAR-Based Staging

The JAR-based approach bundles the pipeline proto, options, and all classpath dependencies into a self-contained executable JAR. This enables offline pipeline packaging for later execution:

// PortablePipelineJarCreator.run() at L89-113
public PortablePipelineResult run(Pipeline pipeline, JobInfo jobInfo) throws Exception {
    // Creates a JAR containing:
    //   BEAM-PIPELINE/pipeline-manifest.json
    //   BEAM-PIPELINE/{jobName}/pipeline.json
    //   BEAM-PIPELINE/{jobName}/pipeline-options.json
    //   BEAM-PIPELINE/{jobName}/artifacts/...
    //   ...Java classes from main JAR...
}

Artifact Types

Artifact Type	Description	Staging Mechanism
SDK worker harness	The SDK runtime that executes user transforms	Environment dependency in pipeline proto
User code JARs	Application code compiled into JAR files	Classpath detection via `detectClassPathResourcesToStage()`
Additional JARs	Extra dependencies specified via `jar_packages` experiment	Experimental flag parsing
Explicitly staged files	Files specified via `PortablePipelineOptions.getFilesToStage()`	Direct user configuration

File Detection and Collection

Before staging, the runner collects all files that need to be staged:

// From PortableRunner.run() at L130-160
ImmutableList.Builder<String> filesToStageBuilder = ImmutableList.builder();
List<String> stagingFiles = options.as(PortablePipelineOptions.class).getFilesToStage();
if (stagingFiles == null) {
    List<String> classpathResources =
        detectClassPathResourcesToStage(Environments.class.getClassLoader(), options);
    filesToStageBuilder.addAll(classpathResources);
} else {
    filesToStageBuilder.addAll(stagingFiles);
}
// Also adds jar_packages from experimental flags

Usage

Artifact staging happens automatically during portable pipeline submission. The JAR creator path is used when creating offline pipeline packages.

gRPC Staging Workflow

Job preparation returns an artifactStagingEndpoint and stagingSessionToken
Client opens a gRPC channel to the staging endpoint
Client streams each artifact to the staging service via ArtifactStagingService.offer()
Staging service stores artifacts keyed by session token
During job run, the server resolves dependencies from the staged artifacts

JAR Packaging Workflow

PortablePipelineJarCreator.run() is invoked with the pipeline and job info
A manifest is created with the main class for later execution
Resources from the original JAR are copied to the output JAR
Pipeline options are serialized as JSON into the JAR
Each artifact is written into the JAR's BEAM-PIPELINE/{jobName}/artifacts/ directory
The pipeline proto (with updated artifact references) is serialized as JSON into the JAR
The resulting JAR can be executed independently via java -jar

Theoretical Basis

Artifact Staging is based on dependency injection and packaging patterns fundamental to distributed execution:

Dependency Closure -- All code and data needed for execution must be available on the runner. Artifact staging ensures the execution environment contains a complete closure of dependencies, analogous to how container images package all runtime dependencies.
Content-Addressable Storage -- In the gRPC staging path, artifacts are keyed by staging session tokens and environment IDs, enabling the server to associate the correct artifacts with each pipeline environment.
Self-Contained Packaging -- The JAR-based approach creates a self-contained executable unit, following the "fat JAR" or "uber JAR" pattern common in Java deployments. This ensures the pipeline can be executed without external dependency resolution.
Late Binding -- The two-step process (prepare then stage) enables late binding of artifact locations. The pipeline proto references abstract artifact descriptors, which are resolved to concrete staged locations during the run phase via resolveDependencies().

JAR Layout

The JAR-based staging produces the following directory structure:

META-INF/
  MANIFEST.MF                          (Main-Class, Manifest-Version)
BEAM-PIPELINE/
  pipeline-manifest.json               (defaultJobName)
  {jobName}/
    pipeline.json                      (RunnerApi.Pipeline as JSON)
    pipeline-options.json              (PipelineOptions as JSON Struct)
    artifacts/
      {uuid-1}                         (First artifact binary)
      {uuid-2}                         (Second artifact binary)
      ...
...Java classes from main JAR...

Related Pages

Implementation:Apache_Beam_PortablePipelineJarCreator -- The JAR-based artifact packaging implementation
Principle:Apache_Beam_Job_Preparation -- Preparation phase provides staging credentials
Principle:Apache_Beam_Pipeline_Translation -- Translation produces the pipeline proto that is staged
Principle:Apache_Beam_Job_Execution -- Execution consumes the staged artifacts

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment