Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Beam Job Submission Twister2

From Leeroopedia


Attribute Value
Principle Name Job Submission Twister2
Domain Job_Management, HPC
Description Process of submitting a translated Beam pipeline as a Twister2 batch job to a cluster and executing the TSet DAG on worker nodes
Deprecation Notice The Twister2 Runner is deprecated and scheduled for removal in Apache Beam 3.0
last_updated 2026-02-09 04:00 GMT

Overview

Job Submission Twister2 describes the process of submitting a translated Beam pipeline as a Twister2 batch job to a cluster and executing the resulting TSet DAG on worker nodes. This is the final stage of the Twister2 pipeline execution lifecycle, following configuration, transform overrides, translation, and classpath packaging.

Note: The Twister2 Runner is deprecated and is scheduled for removal in Apache Beam 3.0. Users should plan migration to an actively maintained runner.

Description

Job submission in the Twister2 runner involves creating a job descriptor, serializing the translated pipeline, and submitting the job to either a local executor or a remote cluster. The process has two sides: the client side (job construction and submission) and the worker side (job deserialization and execution).

Client Side: Job Construction

After pipeline translation completes, the runner constructs a Twister2Job descriptor containing:

Component Value Description
Job name options.getJobName() User-specified or auto-generated job name
Worker class BeamBatchWorker.class The worker implementation that will execute on each node
Compute resources CPUs, RAM, worker count Resource requirements per worker from options
Job config Serialized pipeline data Contains side inputs, leaf IDs, and the TSet graph

The JobConfig carries three critical pieces of serialized data:

  1. sideInputs -- A map of PCollectionView names to TSet node IDs for side inputs
  2. leaves -- A set of TSet node IDs representing pipeline outputs (sinks)
  3. graph -- The complete TBaseGraph containing the TSet DAG

Job Submission

The job is submitted via one of two paths:

  • Local mode (when twister2Home is null/empty) -- Uses LocalSubmitter.submitJob() for single-worker execution. In this mode, the serialized data is passed via a config map (not the JobConfig), parallelism is forced to 1, and network buffer settings are configured.
  • Cluster mode -- Uses Twister2Submitter.submitJob() for multi-worker distributed execution. The serialized data is placed in the JobConfig, and the Twister2 resource allocator handles distribution.

Worker Side: Job Execution

On each worker node, the Twister2 framework instantiates a BeamBatchWorker and calls its execute() method with a BatchTSetEnvironment. The worker:

  1. Deserializes the TSet graph, side input IDs, and leaf IDs from the config
  2. Restores graph state -- Sets the deserialized graph on the environment and resets all TSet/TLink environment references
  3. Resolves side inputs -- Locates side input TSets in the graph by ID and caches them
  4. Executes leaves -- For each leaf TSet, adds a sink and executes the sub-graph

Result Collection

The Twister2Submitter.submitJob() (or LocalSubmitter.submitJob()) returns a Twister2JobState which is wrapped in a Twister2PipelineResult. The job state is mapped to Beam's PipelineResult.State:

Twister2 State Beam State
RUNNING State.RUNNING
COMPLETED State.DONE
FAILED State.FAILED
default State.FAILED

Usage

Job submission is triggered automatically by Twister2Runner.run(). Users interact with this process through:

  • Configuring resources -- options.setWorkerCPUs(), options.setRamMegaBytes(), options.setParallelism()
  • Choosing cluster type -- options.setClusterType("standalone" | "nomad" | "kubernetes" | "mesos")
  • Checking results -- The returned PipelineResult provides the final state

Understanding job submission helps when:

  • Debugging submission failures -- Network issues, resource constraints, or Twister2 configuration problems
  • Troubleshooting serialization errors -- The TSet graph, side inputs, and leaves must be serializable
  • Understanding worker execution -- The worker-side deserialization and execution flow determines how the pipeline actually runs

Theoretical Basis

Job submission in Twister2 is based on the master-worker pattern in HPC computing:

  • Client as coordinator -- The client (master) constructs a complete job specification including code, data, and resource requirements. It does not participate in the actual computation.
  • Workers as executors -- Workers independently execute the computation on their assigned partition. Each worker receives the same serialized graph and independently resolves its portion of the work.
  • Serialization as communication -- The entire pipeline (TSet graph, side inputs, configuration) is serialized into the JobConfig. This is the sole communication channel between the client and workers, following the message-passing paradigm of distributed systems.
  • Resource declaration -- Workers declare their resource needs (CPUs, RAM) upfront, enabling the cluster scheduler to make allocation decisions. This follows the declarative resource management pattern common in HPC and container orchestration systems.
  • State machine for job lifecycle -- The job transitions through states (submitted, running, completed/failed), and the result wraps this state for the client. This follows the finite state machine pattern for lifecycle management.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment