Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Beam Job Preparation

From Leeroopedia


Property Value
Principle Name Job Preparation
Category Pipeline_Orchestration, Job_Management
Related Implementation Implementation:Apache_Beam_InMemoryJobService_Prepare
Source Reference runners/java-job-service/.../InMemoryJobService.java L168-216
last_updated 2026-02-09 04:00 GMT

Overview

Job Preparation is the process of registering a pipeline with the job service and obtaining artifact staging credentials before execution. It is the first phase of the two-phase submission protocol in the Beam Portability Framework.

Description

Job preparation is the handshake between the SDK client and job service. The client sends the pipeline proto and options to the prepare endpoint. The server validates the submission, creates a unique preparation ID, sets up an artifact staging endpoint, and returns credentials for uploading pipeline dependencies.

This separation of preparation from execution allows the client to stage artifacts before starting the job, ensuring that all dependencies are available on the runner when execution begins.

The preparation flow involves the following steps:

  1. Client constructs a PrepareJobRequest -- The request contains the job name, the translated RunnerApi.Pipeline protobuf, and pipeline options serialized as a Struct.
  2. Server generates a preparation ID -- The ID is a unique identifier combining the job name with a UUID (e.g., myJob_a1b2c3d4-...).
  3. Server stores the preparation -- A JobPreparation record is stored in a concurrent map, indexed by the preparation ID.
  4. Server creates a staging token -- The staging service token provider generates a token that authorizes the client to upload artifacts.
  5. Server registers expected artifacts -- The pipeline's environment dependencies are extracted and registered with the artifact staging service.
  6. Server returns PrepareJobResponse -- The response contains the preparation ID, the artifact staging endpoint, and the staging session token.
// Client-side (PortableRunner.run())
PrepareJobRequest prepareJobRequest =
    PrepareJobRequest.newBuilder()
        .setJobName(options.getJobName())
        .setPipeline(pipelineProto)
        .setPipelineOptions(PipelineOptionsTranslation.toProto(options))
        .build();

PrepareJobResponse prepareJobResponse =
    jobService.withDeadlineAfter(jobServerTimeout, TimeUnit.SECONDS)
        .withWaitForReady()
        .prepare(prepareJobRequest);

Preparation Data Model

Field Type Description
preparationId String Unique ID for this preparation, format: {jobName}_{UUID}
pipeline RunnerApi.Pipeline The translated pipeline protobuf
options Struct Pipeline options serialized as a protobuf Struct
artifactStagingEndpoint ApiServiceDescriptor gRPC endpoint for uploading artifacts
stagingSessionToken String Authorization token for artifact staging

Usage

Job preparation happens automatically as part of portable pipeline submission. Understanding preparation helps diagnose pipeline validation errors and staging failures.

Common Preparation Failures

Failure Mode Symptom Root Cause
Null pipeline options NullPointerException during prepare Options not properly serialized
Duplicate preparation ID NOT_FOUND status error UUID collision (extremely rare)
Staging service unavailable Connection refused on staging endpoint Staging service not started
Timeout on prepare RPC DEADLINE_EXCEEDED status Job service overloaded or unreachable

Preparation vs. Execution

The two-phase protocol (prepare then run) provides several advantages:

  • Validation before commitment -- The server can validate the pipeline proto before any resources are allocated for execution.
  • Artifact staging window -- The client has time to upload potentially large artifacts between preparation and execution.
  • Idempotent preparation -- If artifact staging fails, the client can re-prepare without side effects from a partially started job.
  • Resource estimation -- The server can inspect the pipeline during preparation to estimate resource requirements before execution.

Theoretical Basis

Job Preparation is based on the two-phase submission pattern: prepare (register + stage) then run (execute). This pattern appears widely in distributed systems:

  • Two-Phase Commit (2PC) -- While not identical to the classic 2PC protocol, the prepare-then-run pattern shares the concept of a preparation phase that can be rolled back before the commit (execution) phase.
  • Resource Reservation -- The preparation phase acts as a reservation, claiming a unique preparation ID and setting up staging infrastructure before the actual execution begins.
  • Capability-Based Security -- The staging session token returned during preparation acts as a capability token, granting the holder (the SDK client) authorization to upload artifacts to the staging service for this specific preparation.

The separation of concerns between preparation and execution also enables pipelining at the protocol level: the client can begin staging artifacts immediately after receiving the preparation response, overlapping artifact upload with any server-side preparation work.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment