Principle:Heibaiying BigData Notes Storm Application Packaging
Overview
| Property | Value |
|---|---|
| Concept | Storm Application Packaging |
| Category | Stream Processing / Build and Deployment |
| Applies To | Apache Storm Topologies built with Maven |
| Prerequisites | Understanding of Storm topology deployment, Maven build system basics |
Description
Before a Storm topology can be submitted to a production cluster, it must be packaged as a JAR file containing the application code and all of its runtime dependencies. This is commonly referred to as a fat JAR or uber JAR. The packaged JAR is then submitted to the cluster using the storm jar command, which uploads it to the Nimbus daemon for distribution to worker nodes.
Proper packaging is a critical step in the Storm deployment workflow. Incorrect packaging can lead to runtime errors such as:
- ClassNotFoundException -- A required dependency was not included in the JAR.
- Found multiple defaults.yaml resources -- The
storm-coreJAR was included in the fat JAR, conflicting with the one provided by the cluster environment. - No FileSystem for scheme: hdfs -- Service provider configuration files were overwritten during packaging (a known issue with
maven-assembly-plugin).
Usage
There are three primary Maven-based packaging approaches for Storm applications, each with different trade-offs:
Approach 1: Plain mvn package
The simplest approach, but it does not include dependencies. Suitable only for projects with no third-party libraries.
mvn package
When using this approach with external dependencies, they must be specified at submission time:
storm jar topology.jar com.example.MainClass \
--jars "./lib/dependency1.jar,./lib/dependency2.jar"
Approach 2: maven-assembly-plugin
Creates a fat JAR with all dependencies bundled. This is the approach recommended in Storm's official documentation for simple projects.
mvn assembly:assembly
Produces a JAR with suffix -jar-with-dependencies.
Approach 3: maven-shade-plugin (Recommended)
Creates a fat JAR with intelligent resource merging. This is the recommended approach for production use, particularly when integrating with Hadoop ecosystem components (HDFS, HBase, etc.).
mvn package
Produces a shaded JAR alongside an original- prefixed JAR (the unshaded version).
Theoretical Basis
Why Fat JARs Are Necessary
Storm's cluster architecture requires that all application code and dependencies be available on every worker node where the topology's tasks execute. Since worker nodes may not have access to the developer's local Maven repository or the internet, the simplest and most reliable approach is to bundle everything into a single self-contained JAR file.
The storm-core Exclusion Rule
A critical packaging rule is that storm-core must be excluded from the fat JAR. The Storm cluster already provides storm-core in its classpath (located in the lib/ directory of the Storm installation). Including it in the application JAR causes a conflict because both JARs contain defaults.yaml, leading to the "Found multiple defaults.yaml resources" RuntimeException.
There are two ways to exclude storm-core:
- Set its Maven scope to
provided-- This works but prevents local testing since the dependency is not available at compile time. - Exclude it in the packaging plugin configuration -- This is the recommended approach because it allows
storm-coreto remain available during local development while being excluded from the final JAR.
Assembly vs. Shade: Key Differences
| Feature | maven-assembly-plugin | maven-shade-plugin |
|---|---|---|
| Resource handling | Overwrites duplicate resource files | Merges duplicate resource files using configurable transformers |
| Service provider files | Overwrites META-INF/services files | Merges via ServicesResourceTransformer
|
| Manifest handling | Basic manifest configuration | Advanced manifest transformation via ManifestResourceTransformer
|
| Signature files | May include conflicting signatures | Can exclude META-INF signature files (*.SF, *.DSA, *.RSA) |
| HDFS compatibility | May cause "No FileSystem for scheme" errors | Handles service provider merging correctly |
| Recommendation | Suitable for simple topologies | Recommended for all production use |
The maven-shade-plugin's ServicesResourceTransformer is particularly important because Java's ServiceLoader mechanism relies on META-INF/services files to discover implementations. When multiple JARs provide service implementations (common with Hadoop ecosystem libraries), the assembly plugin overwrites these files, causing implementations to be lost. The shade plugin merges them, preserving all service registrations.
Deployment Command
After packaging, the topology is deployed using:
storm jar /path/to/topology-fat.jar com.example.TopologyMainClass [args...]
The storm jar command:
- Adds the specified JAR to the classpath.
- Invokes the specified main class.
- The main class calls
StormSubmitter.submitTopology(). - The JAR is uploaded to Nimbus for distribution.
Related Pages
| Relationship | Page |
|---|---|
| implemented_by | Heibaiying_BigData_Notes_Maven_Packaging_for_Storm |
| related | Heibaiying_BigData_Notes_Storm_Topology_Deployment |
| related | Heibaiying_BigData_Notes_Storm_Parallelism_Configuration |