Principle:Apache Beam Job Submission Twister2

Attribute	Value
Principle Name	Job Submission Twister2
Domain	Job_Management, HPC
Description	Process of submitting a translated Beam pipeline as a Twister2 batch job to a cluster and executing the TSet DAG on worker nodes
Deprecation Notice	The Twister2 Runner is deprecated and scheduled for removal in Apache Beam 3.0
last_updated	2026-02-09 04:00 GMT

Overview

Job Submission Twister2 describes the process of submitting a translated Beam pipeline as a Twister2 batch job to a cluster and executing the resulting TSet DAG on worker nodes. This is the final stage of the Twister2 pipeline execution lifecycle, following configuration, transform overrides, translation, and classpath packaging.

Note: The Twister2 Runner is deprecated and is scheduled for removal in Apache Beam 3.0. Users should plan migration to an actively maintained runner.

Description

Job submission in the Twister2 runner involves creating a job descriptor, serializing the translated pipeline, and submitting the job to either a local executor or a remote cluster. The process has two sides: the client side (job construction and submission) and the worker side (job deserialization and execution).

Client Side: Job Construction

After pipeline translation completes, the runner constructs a Twister2Job descriptor containing:

Component	Value	Description
Job name	`options.getJobName()`	User-specified or auto-generated job name
Worker class	`BeamBatchWorker.class`	The worker implementation that will execute on each node
Compute resources	CPUs, RAM, worker count	Resource requirements per worker from options
Job config	Serialized pipeline data	Contains side inputs, leaf IDs, and the TSet graph

The JobConfig carries three critical pieces of serialized data:

sideInputs -- A map of PCollectionView names to TSet node IDs for side inputs
leaves -- A set of TSet node IDs representing pipeline outputs (sinks)
graph -- The complete TBaseGraph containing the TSet DAG

Job Submission

The job is submitted via one of two paths:

Local mode (when twister2Home is null/empty) -- Uses LocalSubmitter.submitJob() for single-worker execution. In this mode, the serialized data is passed via a config map (not the JobConfig), parallelism is forced to 1, and network buffer settings are configured.

Cluster mode -- Uses Twister2Submitter.submitJob() for multi-worker distributed execution. The serialized data is placed in the JobConfig, and the Twister2 resource allocator handles distribution.

Worker Side: Job Execution

On each worker node, the Twister2 framework instantiates a BeamBatchWorker and calls its execute() method with a BatchTSetEnvironment. The worker:

Deserializes the TSet graph, side input IDs, and leaf IDs from the config
Restores graph state -- Sets the deserialized graph on the environment and resets all TSet/TLink environment references
Resolves side inputs -- Locates side input TSets in the graph by ID and caches them
Executes leaves -- For each leaf TSet, adds a sink and executes the sub-graph

Result Collection

The Twister2Submitter.submitJob() (or LocalSubmitter.submitJob()) returns a Twister2JobState which is wrapped in a Twister2PipelineResult. The job state is mapped to Beam's PipelineResult.State:

Twister2 State	Beam State
RUNNING	State.RUNNING
COMPLETED	State.DONE
FAILED	State.FAILED
default	State.FAILED

Usage

Job submission is triggered automatically by Twister2Runner.run(). Users interact with this process through:

Configuring resources -- options.setWorkerCPUs(), options.setRamMegaBytes(), options.setParallelism()
Choosing cluster type -- options.setClusterType("standalone" | "nomad" | "kubernetes" | "mesos")
Checking results -- The returned PipelineResult provides the final state

Understanding job submission helps when:

Debugging submission failures -- Network issues, resource constraints, or Twister2 configuration problems
Troubleshooting serialization errors -- The TSet graph, side inputs, and leaves must be serializable
Understanding worker execution -- The worker-side deserialization and execution flow determines how the pipeline actually runs

Theoretical Basis

Job submission in Twister2 is based on the master-worker pattern in HPC computing:

Client as coordinator -- The client (master) constructs a complete job specification including code, data, and resource requirements. It does not participate in the actual computation.

Workers as executors -- Workers independently execute the computation on their assigned partition. Each worker receives the same serialized graph and independently resolves its portion of the work.

Serialization as communication -- The entire pipeline (TSet graph, side inputs, configuration) is serialized into the JobConfig. This is the sole communication channel between the client and workers, following the message-passing paradigm of distributed systems.

Resource declaration -- Workers declare their resource needs (CPUs, RAM) upfront, enabling the cluster scheduler to make allocation decisions. This follows the declarative resource management pattern common in HPC and container orchestration systems.

State machine for job lifecycle -- The job transitions through states (submitted, running, completed/failed), and the result wraps this state for the client. This follows the finite state machine pattern for lifecycle management.

Related Pages

Implementation:Apache_Beam_BeamBatchWorker_Execute -- Worker-side execution of the submitted job
Principle:Apache_Beam_Classpath_Packaging -- Packaging step that precedes job submission
Principle:Apache_Beam_Pipeline_Translation_Twister2 -- Translation step that produces the TSet DAG submitted with the job
Principle:Apache_Beam_Twister2_Execution_and_Result_Collection -- Execution details on the worker side

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment