Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Beam Classpath Packaging

From Leeroopedia


Attribute Value
Principle Name Classpath Packaging
Domain Packaging, HPC
Description Process of bundling all Java classpath dependencies and Twister2 libraries into ZIP archives for distribution to cluster nodes
Deprecation Notice The Twister2 Runner is deprecated and scheduled for removal in Apache Beam 3.0
last_updated 2026-02-09 04:00 GMT

Overview

Classpath Packaging describes the process of bundling all Java classpath dependencies into ZIP archives for distribution to Twister2 cluster worker nodes. Before a Twister2 job can be submitted to the cluster, all JARs and libraries that the pipeline depends on must be packaged into a portable archive that can be shipped to remote nodes. This is a critical step because worker nodes do not share the client's classpath.

Note: The Twister2 Runner is deprecated and is scheduled for removal in Apache Beam 3.0. Users should plan migration to an actively maintained runner.

Description

Before submitting a Twister2 job to the cluster, all Java dependencies must be packaged into ZIP archives that can be distributed to worker nodes. The packaging process occurs in two stages within the Twister2Runner:

Stage 1: Prepare Files to Stage

The prepareFilesToStage() method delegates to Beam's PipelineResources.prepareFilesForStaging(options), which scans the current JVM classpath and determines which files need to be staged for remote execution. This uses the FileStagingOptions interface (which Twister2PipelineOptions extends) to determine the list of files.

Stage 2: ZIP File Creation

The zipFilesToStage() method creates a single ZIP archive from the staged files:

  1. Retrieve files to stage -- Gets the list of JAR files from options.getFilesToStage()
  2. Filter Twister2 JARs -- Removes any JARs from the /org/twister2 path, since these are already provided by the Twister2 installation on the cluster
  3. Deduplicate -- Uses a HashSet to track file names and skip duplicates
  4. Create ZIP -- Creates a temporary ZIP file with a lib/ directory prefix, writing each JAR into the archive
  5. Set job file path -- Stores the ZIP file path in options.setJobFileZip()

System Setup

After packaging, the setupSystem() method sets Java system properties that the Twister2 job submission framework reads:

System Property Value Description
cluster_type From options Cluster type (standalone, nomad, kubernetes, mesos)
job_file ZIP path Path to the packaged ZIP archive
job_type From options Job packaging format (default: java_zip)
twister2_home From options (or temp dir) Twister2 installation directory
config_dir twister2Home + /conf/ Twister2 configuration directory

In cluster mode, the system also validates that required configuration files exist: core.yaml, network.yaml, data.yaml, resource.yaml, and task.yaml.

Usage

Classpath packaging happens automatically during Twister2Runner.run() as part of the setupSystem() call. Users do not directly invoke the packaging mechanism. However, understanding it is important when:

  • Troubleshooting "class not found" errors on workers -- Missing classes indicate that the required JAR was not included in the ZIP archive
  • Debugging packaging failures -- FileNotFoundException during ZIP creation suggests classpath configuration issues
  • Optimizing job submission -- Large ZIP files can slow down job distribution; understanding which JARs are included helps with optimization
  • Dealing with Twister2 library conflicts -- The filter that removes /org/twister2 JARs prevents version conflicts between the pipeline's bundled Twister2 and the cluster installation

Local vs Cluster Mode

In local mode (when twister2Home is null or empty):

  • The twister2_home system property is set to java.io.tmpdir
  • The config_dir is set to java.io.tmpdir/conf/
  • No configuration file validation is performed

In cluster mode:

  • The twister2_home and config_dir are set from the options
  • Configuration files are validated for existence
  • Logging configuration is loaded from logger.properties if present

Theoretical Basis

Classpath packaging is based on the closure property of distributed execution: all code needed by workers must be explicitly bundled and shipped since remote nodes do not share the client's classpath. This is a fundamental constraint of distributed computing systems.

Key theoretical principles:

  • Self-Contained Deployment Unit -- The ZIP archive acts as a self-contained deployment unit (analogous to a Docker image or a fat JAR), ensuring reproducible execution on any cluster node.
  • Dependency Isolation -- By filtering out Twister2 JARs from the archive (since they are provided by the cluster), the system avoids version conflicts. This follows the principle of preferring the platform's libraries over application-bundled ones (similar to "provided" scope in Maven).
  • Deduplication -- Duplicate JAR files are eliminated by name to minimize archive size, following the general principle of minimizing data transfer in distributed systems.
  • Idempotent Packaging -- The packaging process is idempotent: given the same classpath and options, it produces functionally equivalent ZIP archives. The ZIP file is created as a temporary file with deleteOnExit() to ensure cleanup.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment