Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Apache Beam Twister2 Batch Execution

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Batch_Processing, HPC
Last Updated 2026-02-09 04:30 GMT

Overview

End-to-end process for executing an Apache Beam batch pipeline on Twister2, an HPC-oriented big data framework, covering pipeline translation to TSet DAGs, classpath packaging, and job submission.

Description

This workflow describes how the Twister2 Runner translates a Beam pipeline into Twister2's TSet (typed set) execution model for batch processing. The Twister2Runner applies transform overrides, invokes the batch pipeline translator to convert Beam transforms into TSet operations, packages the classpath into a deployable archive, and submits the job either locally or to a Twister2 cluster. The runner supports all core Beam batch primitives including ParDo, GroupByKey, Flatten, Window, and Read through dedicated batch translator implementations.

Note: The Twister2 Runner is deprecated and scheduled for removal in Apache Beam 3.0.

Usage

Execute this workflow when you need to run a Beam batch pipeline on a Twister2 cluster, particularly for HPC workloads that benefit from Twister2's efficient inter-process communication and resource management. This runner is suitable for batch-only pipelines on infrastructure where Twister2 is already deployed.

Execution Steps

Step 1: Pipeline Configuration

Configure the pipeline with Twister2-specific options via Twister2PipelineOptions. This includes setting the worker count, cluster type (standalone, Mesos, Kubernetes, Nomad, or Slurm), Twister2 home directory, and resource allocations (CPU, RAM, disk per worker). The Twister2Runner is registered via AutoService and can be selected as the pipeline runner.

Key considerations:

  • Twister2PipelineOptions extends PipelineOptions with cluster-specific settings
  • Twister2RunnerRegistrar provides automatic discovery via ServiceLoader
  • Default configuration targets standalone local execution

Step 2: Transform Override Application

The Twister2Runner applies transform overrides to decompose high-level transforms into primitives the Twister2 translator can handle. SplittableDoFn transforms are decomposed into their constituent processing and restriction tracking components. The overrides ensure compatibility with the batch translation layer.

Key considerations:

  • SplittableParDo overrides are applied for splittable DoFn support
  • The same override mechanism as the Direct Runner is reused
  • Only batch-mode overrides are applied (streaming is not yet supported)

Step 3: Pipeline Translation

The Twister2PipelineExecutionEnvironment orchestrates the translation of the Beam pipeline into a Twister2 TSet DAG. A Twister2BatchPipelineTranslator visits each transform in the pipeline and delegates to a corresponding BatchTransformTranslator. Each translator converts the Beam transform into equivalent TSet operations, maintaining the data flow through Twister2's type-safe distributed dataset abstraction.

Key considerations:

  • ParDoMultiOutputTranslatorBatch handles ParDo with multiple outputs via DoFnFunction
  • GroupByKeyTranslatorBatch implements key-based grouping with window-aware reduce via GroupByWindowFunction
  • ReadSourceTranslatorBatch wraps Beam BoundedSources as Twister2BoundedSource TSets
  • FlattenTranslatorBatch unions multiple input TSets
  • Serialization between stages uses ElemToBytesFunction and ByteToElemFunction

Step 4: Classpath Packaging

The runner packages all required classpath JARs into a zip archive for deployment to the Twister2 cluster. It scans the classpath entries from the ClassLoader, filters for JAR files and directories, and creates a compressed archive that will be distributed to all worker nodes.

Key considerations:

  • All transitive dependencies must be included in the archive
  • The zip is submitted alongside the job configuration
  • System properties for cluster type and Twister2 home are configured

Step 5: Job Submission

The runner submits the packaged job to either a local Twister2 instance or a remote cluster. For local execution, it uses LocalSubmitter which runs the BeamBatchWorker in-process. For cluster execution, it uses Twister2Submitter to deploy to the configured cluster manager. The BeamBatchWorker receives the serialized TSet DAG and executes it within the Twister2 runtime.

Key considerations:

  • BeamBatchWorker is the entry point on each Twister2 worker node
  • The worker deserializes the pipeline options and TSet environment
  • LocalSubmitter is used for development and testing
  • Twister2Submitter handles cluster-mode deployment

Step 6: Execution and Result Collection

The Twister2 runtime executes the TSet DAG across the allocated workers. Each TSet operation runs the corresponding Beam compute function (DoFnFunction for ParDo, GroupByWindowFunction for GBK, etc.) with proper windowing and side input support via Twister2SideInputReader. Upon completion, the Twister2Runner returns a Twister2PipelineResult with the final state.

Key considerations:

  • DoFnFunction manages the full DoFn lifecycle (setup, process, finish, teardown)
  • Twister2SideInputReader provides access to side inputs within compute functions
  • NoOpStepContext provides a minimal step context for the execution environment
  • The result reports DONE, FAILED, or CANCELLED status

Execution Diagram

GitHub URL

Workflow Repository