Principle:Apache Beam Job Submission Twister2
| Attribute | Value |
|---|---|
| Principle Name | Job Submission Twister2 |
| Domain | Job_Management, HPC |
| Description | Process of submitting a translated Beam pipeline as a Twister2 batch job to a cluster and executing the TSet DAG on worker nodes |
| Deprecation Notice | The Twister2 Runner is deprecated and scheduled for removal in Apache Beam 3.0 |
| last_updated | 2026-02-09 04:00 GMT |
Overview
Job Submission Twister2 describes the process of submitting a translated Beam pipeline as a Twister2 batch job to a cluster and executing the resulting TSet DAG on worker nodes. This is the final stage of the Twister2 pipeline execution lifecycle, following configuration, transform overrides, translation, and classpath packaging.
Note: The Twister2 Runner is deprecated and is scheduled for removal in Apache Beam 3.0. Users should plan migration to an actively maintained runner.
Description
Job submission in the Twister2 runner involves creating a job descriptor, serializing the translated pipeline, and submitting the job to either a local executor or a remote cluster. The process has two sides: the client side (job construction and submission) and the worker side (job deserialization and execution).
Client Side: Job Construction
After pipeline translation completes, the runner constructs a Twister2Job descriptor containing:
| Component | Value | Description |
|---|---|---|
| Job name | options.getJobName() |
User-specified or auto-generated job name |
| Worker class | BeamBatchWorker.class |
The worker implementation that will execute on each node |
| Compute resources | CPUs, RAM, worker count | Resource requirements per worker from options |
| Job config | Serialized pipeline data | Contains side inputs, leaf IDs, and the TSet graph |
The JobConfig carries three critical pieces of serialized data:
- sideInputs -- A map of PCollectionView names to TSet node IDs for side inputs
- leaves -- A set of TSet node IDs representing pipeline outputs (sinks)
- graph -- The complete
TBaseGraphcontaining the TSet DAG
Job Submission
The job is submitted via one of two paths:
- Local mode (when
twister2Homeis null/empty) -- UsesLocalSubmitter.submitJob()for single-worker execution. In this mode, the serialized data is passed via a config map (not the JobConfig), parallelism is forced to 1, and network buffer settings are configured.
- Cluster mode -- Uses
Twister2Submitter.submitJob()for multi-worker distributed execution. The serialized data is placed in theJobConfig, and the Twister2 resource allocator handles distribution.
Worker Side: Job Execution
On each worker node, the Twister2 framework instantiates a BeamBatchWorker and calls its execute() method with a BatchTSetEnvironment. The worker:
- Deserializes the TSet graph, side input IDs, and leaf IDs from the config
- Restores graph state -- Sets the deserialized graph on the environment and resets all TSet/TLink environment references
- Resolves side inputs -- Locates side input TSets in the graph by ID and caches them
- Executes leaves -- For each leaf TSet, adds a sink and executes the sub-graph
Result Collection
The Twister2Submitter.submitJob() (or LocalSubmitter.submitJob()) returns a Twister2JobState which is wrapped in a Twister2PipelineResult. The job state is mapped to Beam's PipelineResult.State:
| Twister2 State | Beam State |
|---|---|
| RUNNING | State.RUNNING |
| COMPLETED | State.DONE |
| FAILED | State.FAILED |
| default | State.FAILED |
Usage
Job submission is triggered automatically by Twister2Runner.run(). Users interact with this process through:
- Configuring resources --
options.setWorkerCPUs(),options.setRamMegaBytes(),options.setParallelism() - Choosing cluster type --
options.setClusterType("standalone" | "nomad" | "kubernetes" | "mesos") - Checking results -- The returned
PipelineResultprovides the final state
Understanding job submission helps when:
- Debugging submission failures -- Network issues, resource constraints, or Twister2 configuration problems
- Troubleshooting serialization errors -- The TSet graph, side inputs, and leaves must be serializable
- Understanding worker execution -- The worker-side deserialization and execution flow determines how the pipeline actually runs
Theoretical Basis
Job submission in Twister2 is based on the master-worker pattern in HPC computing:
- Client as coordinator -- The client (master) constructs a complete job specification including code, data, and resource requirements. It does not participate in the actual computation.
- Workers as executors -- Workers independently execute the computation on their assigned partition. Each worker receives the same serialized graph and independently resolves its portion of the work.
- Serialization as communication -- The entire pipeline (TSet graph, side inputs, configuration) is serialized into the
JobConfig. This is the sole communication channel between the client and workers, following the message-passing paradigm of distributed systems.
- Resource declaration -- Workers declare their resource needs (CPUs, RAM) upfront, enabling the cluster scheduler to make allocation decisions. This follows the declarative resource management pattern common in HPC and container orchestration systems.
- State machine for job lifecycle -- The job transitions through states (submitted, running, completed/failed), and the result wraps this state for the client. This follows the finite state machine pattern for lifecycle management.
Related Pages
- Implementation:Apache_Beam_BeamBatchWorker_Execute -- Worker-side execution of the submitted job
- Principle:Apache_Beam_Classpath_Packaging -- Packaging step that precedes job submission
- Principle:Apache_Beam_Pipeline_Translation_Twister2 -- Translation step that produces the TSet DAG submitted with the job
- Principle:Apache_Beam_Twister2_Execution_and_Result_Collection -- Execution details on the worker side