Workflow:Heibaiying BigData Notes Storm Topology Development

Knowledge Sources	BigData-Notes Apache Storm Docs
Domains	Big_Data, Stream_Processing, Storm
Last Updated	2026-02-10 10:00 GMT

Overview

End-to-end process for building and deploying an Apache Storm topology using the Spout-Bolt programming model for real-time stream processing.

Description

This workflow covers the full development lifecycle of an Apache Storm topology. It starts with implementing a Spout to generate or ingest streaming data, then building Bolt components for processing stages (splitting, counting, persisting), wiring them together using a TopologyBuilder with appropriate stream groupings, and finally deploying the topology in either local or cluster mode. The workflow demonstrates the foundational pattern used across all Storm integrations including HBase, HDFS, Kafka, and Redis sinks.

Usage

Execute this workflow when you need to build a real-time stream processing application using Apache Storm. This applies when you have a continuous data source (simulated, Kafka, or other) and need to process, transform, and aggregate data with low latency, optionally persisting results to external storage systems.

Execution Steps

Step 1: Implement the Data Source Spout

Create a Spout class extending BaseRichSpout that serves as the data entry point for the topology. Implement the open() method to initialize the SpoutOutputCollector, the nextTuple() method to emit data tuples, and declareOutputFields() to define the output schema.

Key considerations:

nextTuple() is called in a tight loop; add sleep when no data is available
Use SpoutOutputCollector.emit() to send tuples to downstream bolts
Declare output field names that downstream bolts will reference
For reliable processing, emit tuples with a message ID to enable replay on failure

Step 2: Implement Processing Bolts

Create one or more Bolt classes extending BaseRichBolt for each processing stage. Implement prepare() for initialization, execute() for tuple processing logic, and declareOutputFields() for output schema. Each bolt receives tuples from upstream, processes them, and optionally emits results downstream.

What happens:

SplitBolt: receives lines of text, tokenizes into individual words, emits each word
CountBolt: maintains in-memory word counts, accumulates frequencies
Custom bolts: implement domain-specific processing logic
Each bolt must declare its output fields for downstream consumption

Step 3: Wire the Topology

Use TopologyBuilder to assemble Spouts and Bolts into a directed acyclic graph. Define data flow connections and choose stream grouping strategies that control how tuples are distributed between parallel bolt instances.

Grouping strategies:

Shuffle grouping: random distribution across bolt instances
Fields grouping: route tuples by specific field values (e.g., same word to same bolt)
All grouping: broadcast tuples to all bolt instances
Global grouping: send all tuples to a single bolt instance

Step 4: Configure Parallelism

Set the number of parallel instances (executors) for each Spout and Bolt component. Configure the number of worker processes for the topology. Tune parallelism based on data volume and processing requirements.

Key considerations:

Set parallelism hint on each component via setSpout() and setBolt()
Number of workers controls JVM process count across the cluster
Balance parallelism across components to avoid bottlenecks
Spouts typically need fewer instances than compute-heavy bolts

Step 5: Deploy in Local or Cluster Mode

For development and testing, submit the topology to a LocalCluster that runs entirely within a single JVM. For production, use StormSubmitter to deploy the topology to a remote Storm cluster managed by Nimbus.

What happens:

Local mode: instantiate LocalCluster, submit topology, run for a duration, then shutdown
Cluster mode: package as JAR, use StormSubmitter.submitTopology() to submit to Nimbus
Storm distributes the topology across available Supervisor nodes
The topology runs continuously until explicitly killed

Step 6: Package for Deployment

Package the Storm application as a JAR file using Maven. Choose between three packaging strategies: maven-shade-plugin for uber-JARs with dependency relocation, maven-assembly-plugin for simple fat JARs, or placing dependencies in the Storm lib directory.

Key considerations:

Exclude Storm core dependencies (provided by the cluster runtime)
Use shade plugin for production deployments to avoid classpath conflicts
Assembly plugin is simpler but may cause conflicts with signed JARs
Include all third-party dependencies required by bolts (e.g., HBase, Redis clients)

Execution Diagram

GitHub URL

Workflow Repository