Principle:Apache Beam Job Preparation
| Property | Value |
|---|---|
| Principle Name | Job Preparation |
| Category | Pipeline_Orchestration, Job_Management |
| Related Implementation | Implementation:Apache_Beam_InMemoryJobService_Prepare |
| Source Reference | runners/java-job-service/.../InMemoryJobService.java L168-216 |
| last_updated | 2026-02-09 04:00 GMT |
Overview
Job Preparation is the process of registering a pipeline with the job service and obtaining artifact staging credentials before execution. It is the first phase of the two-phase submission protocol in the Beam Portability Framework.
Description
Job preparation is the handshake between the SDK client and job service. The client sends the pipeline proto and options to the prepare endpoint. The server validates the submission, creates a unique preparation ID, sets up an artifact staging endpoint, and returns credentials for uploading pipeline dependencies.
This separation of preparation from execution allows the client to stage artifacts before starting the job, ensuring that all dependencies are available on the runner when execution begins.
The preparation flow involves the following steps:
- Client constructs a PrepareJobRequest -- The request contains the job name, the translated
RunnerApi.Pipelineprotobuf, and pipeline options serialized as aStruct. - Server generates a preparation ID -- The ID is a unique identifier combining the job name with a UUID (e.g.,
myJob_a1b2c3d4-...). - Server stores the preparation -- A
JobPreparationrecord is stored in a concurrent map, indexed by the preparation ID. - Server creates a staging token -- The staging service token provider generates a token that authorizes the client to upload artifacts.
- Server registers expected artifacts -- The pipeline's environment dependencies are extracted and registered with the artifact staging service.
- Server returns PrepareJobResponse -- The response contains the preparation ID, the artifact staging endpoint, and the staging session token.
// Client-side (PortableRunner.run())
PrepareJobRequest prepareJobRequest =
PrepareJobRequest.newBuilder()
.setJobName(options.getJobName())
.setPipeline(pipelineProto)
.setPipelineOptions(PipelineOptionsTranslation.toProto(options))
.build();
PrepareJobResponse prepareJobResponse =
jobService.withDeadlineAfter(jobServerTimeout, TimeUnit.SECONDS)
.withWaitForReady()
.prepare(prepareJobRequest);
Preparation Data Model
| Field | Type | Description |
|---|---|---|
preparationId |
String |
Unique ID for this preparation, format: {jobName}_{UUID}
|
pipeline |
RunnerApi.Pipeline |
The translated pipeline protobuf |
options |
Struct |
Pipeline options serialized as a protobuf Struct |
artifactStagingEndpoint |
ApiServiceDescriptor |
gRPC endpoint for uploading artifacts |
stagingSessionToken |
String |
Authorization token for artifact staging |
Usage
Job preparation happens automatically as part of portable pipeline submission. Understanding preparation helps diagnose pipeline validation errors and staging failures.
Common Preparation Failures
| Failure Mode | Symptom | Root Cause |
|---|---|---|
| Null pipeline options | NullPointerException during prepare |
Options not properly serialized |
| Duplicate preparation ID | NOT_FOUND status error |
UUID collision (extremely rare) |
| Staging service unavailable | Connection refused on staging endpoint | Staging service not started |
| Timeout on prepare RPC | DEADLINE_EXCEEDED status |
Job service overloaded or unreachable |
Preparation vs. Execution
The two-phase protocol (prepare then run) provides several advantages:
- Validation before commitment -- The server can validate the pipeline proto before any resources are allocated for execution.
- Artifact staging window -- The client has time to upload potentially large artifacts between preparation and execution.
- Idempotent preparation -- If artifact staging fails, the client can re-prepare without side effects from a partially started job.
- Resource estimation -- The server can inspect the pipeline during preparation to estimate resource requirements before execution.
Theoretical Basis
Job Preparation is based on the two-phase submission pattern: prepare (register + stage) then run (execute). This pattern appears widely in distributed systems:
- Two-Phase Commit (2PC) -- While not identical to the classic 2PC protocol, the prepare-then-run pattern shares the concept of a preparation phase that can be rolled back before the commit (execution) phase.
- Resource Reservation -- The preparation phase acts as a reservation, claiming a unique preparation ID and setting up staging infrastructure before the actual execution begins.
- Capability-Based Security -- The staging session token returned during preparation acts as a capability token, granting the holder (the SDK client) authorization to upload artifacts to the staging service for this specific preparation.
The separation of concerns between preparation and execution also enables pipelining at the protocol level: the client can begin staging artifacts immediately after receiving the preparation response, overlapping artifact upload with any server-side preparation work.
Related Pages
- Implementation:Apache_Beam_InMemoryJobService_Prepare -- The concrete server-side implementation of the prepare RPC
- Principle:Apache_Beam_Pipeline_Translation -- Translation must complete before preparation can begin
- Principle:Apache_Beam_Artifact_Staging -- Artifact staging occurs between preparation and execution
- Principle:Apache_Beam_Job_Execution -- Execution begins after preparation and staging are complete