Workflow:Apache Beam Twister2 Batch Execution
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Batch_Processing, HPC |
| Last Updated | 2026-02-09 04:30 GMT |
Overview
End-to-end process for executing an Apache Beam batch pipeline on Twister2, an HPC-oriented big data framework, covering pipeline translation to TSet DAGs, classpath packaging, and job submission.
Description
This workflow describes how the Twister2 Runner translates a Beam pipeline into Twister2's TSet (typed set) execution model for batch processing. The Twister2Runner applies transform overrides, invokes the batch pipeline translator to convert Beam transforms into TSet operations, packages the classpath into a deployable archive, and submits the job either locally or to a Twister2 cluster. The runner supports all core Beam batch primitives including ParDo, GroupByKey, Flatten, Window, and Read through dedicated batch translator implementations.
Note: The Twister2 Runner is deprecated and scheduled for removal in Apache Beam 3.0.
Usage
Execute this workflow when you need to run a Beam batch pipeline on a Twister2 cluster, particularly for HPC workloads that benefit from Twister2's efficient inter-process communication and resource management. This runner is suitable for batch-only pipelines on infrastructure where Twister2 is already deployed.
Execution Steps
Step 1: Pipeline Configuration
Configure the pipeline with Twister2-specific options via Twister2PipelineOptions. This includes setting the worker count, cluster type (standalone, Mesos, Kubernetes, Nomad, or Slurm), Twister2 home directory, and resource allocations (CPU, RAM, disk per worker). The Twister2Runner is registered via AutoService and can be selected as the pipeline runner.
Key considerations:
- Twister2PipelineOptions extends PipelineOptions with cluster-specific settings
- Twister2RunnerRegistrar provides automatic discovery via ServiceLoader
- Default configuration targets standalone local execution
Step 2: Transform Override Application
The Twister2Runner applies transform overrides to decompose high-level transforms into primitives the Twister2 translator can handle. SplittableDoFn transforms are decomposed into their constituent processing and restriction tracking components. The overrides ensure compatibility with the batch translation layer.
Key considerations:
- SplittableParDo overrides are applied for splittable DoFn support
- The same override mechanism as the Direct Runner is reused
- Only batch-mode overrides are applied (streaming is not yet supported)
Step 3: Pipeline Translation
The Twister2PipelineExecutionEnvironment orchestrates the translation of the Beam pipeline into a Twister2 TSet DAG. A Twister2BatchPipelineTranslator visits each transform in the pipeline and delegates to a corresponding BatchTransformTranslator. Each translator converts the Beam transform into equivalent TSet operations, maintaining the data flow through Twister2's type-safe distributed dataset abstraction.
Key considerations:
- ParDoMultiOutputTranslatorBatch handles ParDo with multiple outputs via DoFnFunction
- GroupByKeyTranslatorBatch implements key-based grouping with window-aware reduce via GroupByWindowFunction
- ReadSourceTranslatorBatch wraps Beam BoundedSources as Twister2BoundedSource TSets
- FlattenTranslatorBatch unions multiple input TSets
- Serialization between stages uses ElemToBytesFunction and ByteToElemFunction
Step 4: Classpath Packaging
The runner packages all required classpath JARs into a zip archive for deployment to the Twister2 cluster. It scans the classpath entries from the ClassLoader, filters for JAR files and directories, and creates a compressed archive that will be distributed to all worker nodes.
Key considerations:
- All transitive dependencies must be included in the archive
- The zip is submitted alongside the job configuration
- System properties for cluster type and Twister2 home are configured
Step 5: Job Submission
The runner submits the packaged job to either a local Twister2 instance or a remote cluster. For local execution, it uses LocalSubmitter which runs the BeamBatchWorker in-process. For cluster execution, it uses Twister2Submitter to deploy to the configured cluster manager. The BeamBatchWorker receives the serialized TSet DAG and executes it within the Twister2 runtime.
Key considerations:
- BeamBatchWorker is the entry point on each Twister2 worker node
- The worker deserializes the pipeline options and TSet environment
- LocalSubmitter is used for development and testing
- Twister2Submitter handles cluster-mode deployment
Step 6: Execution and Result Collection
The Twister2 runtime executes the TSet DAG across the allocated workers. Each TSet operation runs the corresponding Beam compute function (DoFnFunction for ParDo, GroupByWindowFunction for GBK, etc.) with proper windowing and side input support via Twister2SideInputReader. Upon completion, the Twister2Runner returns a Twister2PipelineResult with the final state.
Key considerations:
- DoFnFunction manages the full DoFn lifecycle (setup, process, finish, teardown)
- Twister2SideInputReader provides access to side inputs within compute functions
- NoOpStepContext provides a minimal step context for the execution environment
- The result reports DONE, FAILED, or CANCELLED status