Principle:Apache Beam Classpath Packaging
| Attribute | Value |
|---|---|
| Principle Name | Classpath Packaging |
| Domain | Packaging, HPC |
| Description | Process of bundling all Java classpath dependencies and Twister2 libraries into ZIP archives for distribution to cluster nodes |
| Deprecation Notice | The Twister2 Runner is deprecated and scheduled for removal in Apache Beam 3.0 |
| last_updated | 2026-02-09 04:00 GMT |
Overview
Classpath Packaging describes the process of bundling all Java classpath dependencies into ZIP archives for distribution to Twister2 cluster worker nodes. Before a Twister2 job can be submitted to the cluster, all JARs and libraries that the pipeline depends on must be packaged into a portable archive that can be shipped to remote nodes. This is a critical step because worker nodes do not share the client's classpath.
Note: The Twister2 Runner is deprecated and is scheduled for removal in Apache Beam 3.0. Users should plan migration to an actively maintained runner.
Description
Before submitting a Twister2 job to the cluster, all Java dependencies must be packaged into ZIP archives that can be distributed to worker nodes. The packaging process occurs in two stages within the Twister2Runner:
Stage 1: Prepare Files to Stage
The prepareFilesToStage() method delegates to Beam's PipelineResources.prepareFilesForStaging(options), which scans the current JVM classpath and determines which files need to be staged for remote execution. This uses the FileStagingOptions interface (which Twister2PipelineOptions extends) to determine the list of files.
Stage 2: ZIP File Creation
The zipFilesToStage() method creates a single ZIP archive from the staged files:
- Retrieve files to stage -- Gets the list of JAR files from
options.getFilesToStage() - Filter Twister2 JARs -- Removes any JARs from the
/org/twister2path, since these are already provided by the Twister2 installation on the cluster - Deduplicate -- Uses a
HashSetto track file names and skip duplicates - Create ZIP -- Creates a temporary ZIP file with a
lib/directory prefix, writing each JAR into the archive - Set job file path -- Stores the ZIP file path in
options.setJobFileZip()
System Setup
After packaging, the setupSystem() method sets Java system properties that the Twister2 job submission framework reads:
| System Property | Value | Description |
|---|---|---|
cluster_type |
From options | Cluster type (standalone, nomad, kubernetes, mesos) |
job_file |
ZIP path | Path to the packaged ZIP archive |
job_type |
From options | Job packaging format (default: java_zip) |
twister2_home |
From options (or temp dir) | Twister2 installation directory |
config_dir |
twister2Home + /conf/ | Twister2 configuration directory |
In cluster mode, the system also validates that required configuration files exist: core.yaml, network.yaml, data.yaml, resource.yaml, and task.yaml.
Usage
Classpath packaging happens automatically during Twister2Runner.run() as part of the setupSystem() call. Users do not directly invoke the packaging mechanism. However, understanding it is important when:
- Troubleshooting "class not found" errors on workers -- Missing classes indicate that the required JAR was not included in the ZIP archive
- Debugging packaging failures --
FileNotFoundExceptionduring ZIP creation suggests classpath configuration issues - Optimizing job submission -- Large ZIP files can slow down job distribution; understanding which JARs are included helps with optimization
- Dealing with Twister2 library conflicts -- The filter that removes
/org/twister2JARs prevents version conflicts between the pipeline's bundled Twister2 and the cluster installation
Local vs Cluster Mode
In local mode (when twister2Home is null or empty):
- The
twister2_homesystem property is set tojava.io.tmpdir - The
config_diris set tojava.io.tmpdir/conf/ - No configuration file validation is performed
In cluster mode:
- The
twister2_homeandconfig_dirare set from the options - Configuration files are validated for existence
- Logging configuration is loaded from
logger.propertiesif present
Theoretical Basis
Classpath packaging is based on the closure property of distributed execution: all code needed by workers must be explicitly bundled and shipped since remote nodes do not share the client's classpath. This is a fundamental constraint of distributed computing systems.
Key theoretical principles:
- Self-Contained Deployment Unit -- The ZIP archive acts as a self-contained deployment unit (analogous to a Docker image or a fat JAR), ensuring reproducible execution on any cluster node.
- Dependency Isolation -- By filtering out Twister2 JARs from the archive (since they are provided by the cluster), the system avoids version conflicts. This follows the principle of preferring the platform's libraries over application-bundled ones (similar to "provided" scope in Maven).
- Deduplication -- Duplicate JAR files are eliminated by name to minimize archive size, following the general principle of minimizing data transfer in distributed systems.
- Idempotent Packaging -- The packaging process is idempotent: given the same classpath and options, it produces functionally equivalent ZIP archives. The ZIP file is created as a temporary file with
deleteOnExit()to ensure cleanup.
Related Pages
- Implementation:Apache_Beam_Twister2Runner_ZipDependencies -- Concrete implementation of the ZIP packaging logic
- Principle:Apache_Beam_Job_Submission_Twister2 -- Job submission that uses the packaged ZIP
- Principle:Apache_Beam_Pipeline_Configuration_Twister2 -- Configuration that determines packaging parameters