Principle:Heibaiying BigData Notes Storm Topology Wiring
Overview
| Property | Value |
|---|---|
| Concept | Storm Topology Wiring |
| Category | Stream Processing / Topology Construction |
| Applies To | Apache Storm Topologies |
| Prerequisites | Understanding of Spout and Bolt implementations, and the concept of stream groupings |
Description
A Storm Topology is a directed acyclic graph (DAG) that defines how data flows from sources (Spouts) through processing stages (Bolts). Wiring a topology is the act of using the TopologyBuilder API to register Spouts and Bolts, and to specify how tuples are routed between them via stream groupings.
The TopologyBuilder class is the primary API for constructing Storm topologies. It provides two key methods:
- setSpout(id, spout) -- Registers a Spout with a unique string identifier.
- setBolt(id, bolt) -- Registers a Bolt with a unique string identifier and returns a declaration object that allows the developer to specify input sources and stream groupings.
After all components are registered, builder.createTopology() produces a StormTopology object that can be submitted to either a local cluster or a remote production cluster.
Usage
Topology wiring is the central configuration step in any Storm application. The developer must decide:
- Which Spouts will produce data and what their component IDs will be.
- Which Bolts will process data, how they connect to upstream components, and what stream groupings to use.
- The overall DAG structure -- whether the topology is a simple linear chain, a fan-out/fan-in pattern, or a complex graph with multiple branches.
A typical wiring pattern for the word-count topology:
TopologyBuilder builder = new TopologyBuilder();
// Register the data source
builder.setSpout("DataSourceSpout", new DataSourceSpout());
// Register processing bolts with stream groupings
builder.setBolt("SplitBolt", new SplitBolt())
.shuffleGrouping("DataSourceSpout");
builder.setBolt("CountBolt", new CountBolt())
.shuffleGrouping("SplitBolt");
// Build the topology
StormTopology topology = builder.createTopology();
Theoretical Basis
Topology wiring embodies the composition principle in dataflow programming. A complex computation is decomposed into simple, reusable operators (Spouts and Bolts) that are composed together via explicit data routing rules.
Stream Groupings
Stream groupings are the mechanism by which Storm determines how tuples are distributed from a source component to a target component. The choice of grouping has significant implications for processing semantics and load balancing:
| Grouping | Description | Use Case |
|---|---|---|
| Shuffle Grouping | Distributes tuples randomly and evenly across all tasks of the target Bolt | General-purpose load balancing; use when tuple ordering and affinity do not matter |
| Fields Grouping | Routes tuples to target tasks based on a hash of specified fields; tuples with the same field values always go to the same task | Aggregation by key (e.g., word counting); ensures all tuples for a given word reach the same task |
| All Grouping | Broadcasts every tuple to all tasks of the target Bolt | Distributing configuration updates or reference data to all processing instances |
| Global Grouping | Sends all tuples to the single task with the lowest ID | Producing a single global aggregate or ordering |
| Direct Grouping | The producer specifies exactly which task receives the tuple | Fine-grained routing control for advanced use cases |
| Local or Shuffle Grouping | Prefers routing to tasks within the same worker process; falls back to shuffle if no local task exists | Reducing network overhead by favoring intra-process communication |
DAG Structure
The topology DAG imposes a partial order on processing. Tuples flow from Spouts through a series of Bolts, with each Bolt potentially feeding multiple downstream Bolts (fan-out) or receiving from multiple upstream components (fan-in). The acyclic constraint ensures there are no circular dependencies, which simplifies reasoning about data flow and enables Storm's tuple tracking mechanism for reliable processing.
Component IDs
Each Spout and Bolt is assigned a unique string identifier during wiring. These IDs serve multiple purposes:
- They define the edges in the topology DAG (a Bolt references its input source by ID).
- They appear in Storm's monitoring UI for operational visibility.
- They are used in log messages for debugging.
Related Pages
| Relationship | Page |
|---|---|
| implemented_by | Heibaiying_BigData_Notes_TopologyBuilder_Usage |
| related | Heibaiying_BigData_Notes_Storm_Spout_Implementation |
| related | Heibaiying_BigData_Notes_Storm_Bolt_Implementation |
| related | Heibaiying_BigData_Notes_Storm_Parallelism_Configuration |