Principle:Apache Beam Pipeline Translation Twister2
| Attribute | Value |
|---|---|
| Principle Name | Pipeline Translation Twister2 |
| Domain | Pipeline_Orchestration, Compilation, HPC |
| Description | Process of translating a Beam pipeline's transform graph into a Twister2 TSet directed acyclic graph for HPC batch execution |
| Deprecation Notice | The Twister2 Runner is deprecated and scheduled for removal in Apache Beam 3.0 |
| last_updated | 2026-02-09 04:00 GMT |
Overview
Pipeline Translation Twister2 describes the process of converting a Beam pipeline's transform graph into a Twister2 TSet directed acyclic graph (DAG) for execution on HPC batch systems. This translation is the bridge between Beam's portable, runner-agnostic pipeline representation and Twister2's native execution model.
Note: The Twister2 Runner is deprecated and is scheduled for removal in Apache Beam 3.0. Users should plan migration to an actively maintained runner.
Description
Pipeline translation converts Beam transforms into Twister2 TSet operations using a two-phase approach:
Phase 1: Mode Detection
The Twister2PipelineExecutionEnvironment first traverses the pipeline using a TranslationModeDetector visitor to determine whether the pipeline is bounded (batch) or unbounded (streaming). If any PCollection is found to be UNBOUNDED, the runner switches to streaming mode. However, streaming is not currently supported by the Twister2 Runner, and an UnsupportedOperationException is thrown in that case.
Phase 2: Transform-by-Transform Translation
For batch pipelines, a Twister2BatchPipelineTranslator is created and uses the visitor pattern to traverse the pipeline topologically. Each primitive transform is dispatched to a registered BatchTransformTranslator that creates the corresponding TSet operation. The translation builds up a Twister2BatchTranslationContext containing the TSet DAG.
The following transform translators are registered:
| Transform URN | Translator Class | TSet Operation |
|---|---|---|
IMPULSE_TRANSFORM_URN |
ImpulseTranslatorBatch |
Source TSet |
READ_TRANSFORM_URN |
ReadSourceTranslatorBatch |
Source TSet from BoundedSource |
PAR_DO_TRANSFORM_URN |
ParDoMultiOutputTranslatorBatch |
ComputeCollector TSet via DoFnFunction |
GROUP_BY_KEY_TRANSFORM_URN |
GroupByKeyTranslatorBatch |
Keyed TSet operations with GroupByWindowFunction |
FLATTEN_TRANSFORM_URN |
FlattenTranslatorBatch |
Union TSet |
CREATE_VIEW_TRANSFORM_URN |
PCollectionViewTranslatorBatch |
Cached side input TSet |
ASSIGN_WINDOWS_TRANSFORM_URN |
AssignWindowTranslatorBatch |
Window assignment TSet |
The translation process works as follows:
- The translator's
translate(pipeline)method triggers a topological traversal of the pipeline - For each primitive transform node,
visitPrimitiveTransform(node)is called - The transform's URN is looked up in the
TRANSFORM_TRANSLATORSmap - If a translator is found,
translator.translateNode(transform, translationContext)is called - If no translator is found, an
IllegalStateExceptionis thrown
The Twister2BatchTranslationContext accumulates:
- TSet nodes representing individual operations
- Side input datasets as
BatchTSetinstances keyed by PCollectionView name - Leaf nodes representing pipeline outputs (sinks)
- The TSet graph connecting all operations
Usage
Pipeline translation happens automatically during Twister2Runner.run() when env.translate(pipeline) is called. Users do not directly interact with the translator. Understanding translation is useful when:
- Debugging unsupported transforms -- If a transform has no registered translator, translation fails with
IllegalStateException: no translator registered for ... - Understanding performance characteristics -- The mapping from Beam transforms to TSet operations determines shuffle boundaries and parallelism behavior
- Extending the runner -- Adding support for new transforms requires registering a new
BatchTransformTranslator
Theoretical Basis
Pipeline translation is based on several well-established compiler and software engineering principles:
- Compiler IR Lowering -- Translation acts as a lowering pass that converts a high-level intermediate representation (Beam transforms) to a low-level execution plan (TSet DAG). This is analogous to how a compiler lowers a high-level AST to machine-specific instructions.
- Visitor Pattern -- The
Twister2BatchPipelineTranslatorextendsTwister2PipelineTranslator(which implementsPipeline.PipelineVisitor). The visitor pattern enables extensible, per-transform translation without modifying the pipeline data structure. Each transform type has its own translator registered in a dispatch map.
- Registry Pattern -- The static
TRANSFORM_TRANSLATORSmap acts as a registry, mapping transform URNs to their corresponding translator implementations. This enables O(1) dispatch and easy extension.
- Topological Sort -- The pipeline is traversed topologically, ensuring that all inputs to a transform have been translated before the transform itself. This guarantees that TSet inputs are available when a TSet operation is created.
- Context Object Pattern -- The
Twister2BatchTranslationContextserves as a mutable context object passed through the translation process, accumulating the resulting TSet DAG incrementally.
Related Pages
- Implementation:Apache_Beam_Twister2BatchPipelineTranslator -- Concrete translator implementation for batch pipelines
- Principle:Apache_Beam_Transform_Override_Application_Twister2 -- Override step that precedes translation
- Principle:Apache_Beam_Job_Submission_Twister2 -- Job submission step that follows translation
- Principle:Apache_Beam_Twister2_Execution_and_Result_Collection -- Execution of the translated TSet DAG