Workflow:Heibaiying BigData Notes Storm Topology Development
| Knowledge Sources | |
|---|---|
| Domains | Big_Data, Stream_Processing, Storm |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
End-to-end process for building and deploying an Apache Storm topology using the Spout-Bolt programming model for real-time stream processing.
Description
This workflow covers the full development lifecycle of an Apache Storm topology. It starts with implementing a Spout to generate or ingest streaming data, then building Bolt components for processing stages (splitting, counting, persisting), wiring them together using a TopologyBuilder with appropriate stream groupings, and finally deploying the topology in either local or cluster mode. The workflow demonstrates the foundational pattern used across all Storm integrations including HBase, HDFS, Kafka, and Redis sinks.
Usage
Execute this workflow when you need to build a real-time stream processing application using Apache Storm. This applies when you have a continuous data source (simulated, Kafka, or other) and need to process, transform, and aggregate data with low latency, optionally persisting results to external storage systems.
Execution Steps
Step 1: Implement the Data Source Spout
Create a Spout class extending BaseRichSpout that serves as the data entry point for the topology. Implement the open() method to initialize the SpoutOutputCollector, the nextTuple() method to emit data tuples, and declareOutputFields() to define the output schema.
Key considerations:
- nextTuple() is called in a tight loop; add sleep when no data is available
- Use SpoutOutputCollector.emit() to send tuples to downstream bolts
- Declare output field names that downstream bolts will reference
- For reliable processing, emit tuples with a message ID to enable replay on failure
Step 2: Implement Processing Bolts
Create one or more Bolt classes extending BaseRichBolt for each processing stage. Implement prepare() for initialization, execute() for tuple processing logic, and declareOutputFields() for output schema. Each bolt receives tuples from upstream, processes them, and optionally emits results downstream.
What happens:
- SplitBolt: receives lines of text, tokenizes into individual words, emits each word
- CountBolt: maintains in-memory word counts, accumulates frequencies
- Custom bolts: implement domain-specific processing logic
- Each bolt must declare its output fields for downstream consumption
Step 3: Wire the Topology
Use TopologyBuilder to assemble Spouts and Bolts into a directed acyclic graph. Define data flow connections and choose stream grouping strategies that control how tuples are distributed between parallel bolt instances.
Grouping strategies:
- Shuffle grouping: random distribution across bolt instances
- Fields grouping: route tuples by specific field values (e.g., same word to same bolt)
- All grouping: broadcast tuples to all bolt instances
- Global grouping: send all tuples to a single bolt instance
Step 4: Configure Parallelism
Set the number of parallel instances (executors) for each Spout and Bolt component. Configure the number of worker processes for the topology. Tune parallelism based on data volume and processing requirements.
Key considerations:
- Set parallelism hint on each component via setSpout() and setBolt()
- Number of workers controls JVM process count across the cluster
- Balance parallelism across components to avoid bottlenecks
- Spouts typically need fewer instances than compute-heavy bolts
Step 5: Deploy in Local or Cluster Mode
For development and testing, submit the topology to a LocalCluster that runs entirely within a single JVM. For production, use StormSubmitter to deploy the topology to a remote Storm cluster managed by Nimbus.
What happens:
- Local mode: instantiate LocalCluster, submit topology, run for a duration, then shutdown
- Cluster mode: package as JAR, use StormSubmitter.submitTopology() to submit to Nimbus
- Storm distributes the topology across available Supervisor nodes
- The topology runs continuously until explicitly killed
Step 6: Package for Deployment
Package the Storm application as a JAR file using Maven. Choose between three packaging strategies: maven-shade-plugin for uber-JARs with dependency relocation, maven-assembly-plugin for simple fat JARs, or placing dependencies in the Storm lib directory.
Key considerations:
- Exclude Storm core dependencies (provided by the cluster runtime)
- Use shade plugin for production deployments to avoid classpath conflicts
- Assembly plugin is simpler but may cause conflicts with signed JARs
- Include all third-party dependencies required by bolts (e.g., HBase, Redis clients)