Principle:Heibaiying BigData Notes Storm Topology Wiring

Overview

Property	Value
Concept	Storm Topology Wiring
Category	Stream Processing / Topology Construction
Applies To	Apache Storm Topologies
Prerequisites	Understanding of Spout and Bolt implementations, and the concept of stream groupings

Description

A Storm Topology is a directed acyclic graph (DAG) that defines how data flows from sources (Spouts) through processing stages (Bolts). Wiring a topology is the act of using the TopologyBuilder API to register Spouts and Bolts, and to specify how tuples are routed between them via stream groupings.

The TopologyBuilder class is the primary API for constructing Storm topologies. It provides two key methods:

setSpout(id, spout) -- Registers a Spout with a unique string identifier.
setBolt(id, bolt) -- Registers a Bolt with a unique string identifier and returns a declaration object that allows the developer to specify input sources and stream groupings.

After all components are registered, builder.createTopology() produces a StormTopology object that can be submitted to either a local cluster or a remote production cluster.

Usage

Topology wiring is the central configuration step in any Storm application. The developer must decide:

Which Spouts will produce data and what their component IDs will be.
Which Bolts will process data, how they connect to upstream components, and what stream groupings to use.
The overall DAG structure -- whether the topology is a simple linear chain, a fan-out/fan-in pattern, or a complex graph with multiple branches.

A typical wiring pattern for the word-count topology:

TopologyBuilder builder = new TopologyBuilder();

// Register the data source
builder.setSpout("DataSourceSpout", new DataSourceSpout());

// Register processing bolts with stream groupings
builder.setBolt("SplitBolt", new SplitBolt())
       .shuffleGrouping("DataSourceSpout");
builder.setBolt("CountBolt", new CountBolt())
       .shuffleGrouping("SplitBolt");

// Build the topology
StormTopology topology = builder.createTopology();

Theoretical Basis

Topology wiring embodies the composition principle in dataflow programming. A complex computation is decomposed into simple, reusable operators (Spouts and Bolts) that are composed together via explicit data routing rules.

Stream Groupings

Stream groupings are the mechanism by which Storm determines how tuples are distributed from a source component to a target component. The choice of grouping has significant implications for processing semantics and load balancing:

Grouping	Description	Use Case
Shuffle Grouping	Distributes tuples randomly and evenly across all tasks of the target Bolt	General-purpose load balancing; use when tuple ordering and affinity do not matter
Fields Grouping	Routes tuples to target tasks based on a hash of specified fields; tuples with the same field values always go to the same task	Aggregation by key (e.g., word counting); ensures all tuples for a given word reach the same task
All Grouping	Broadcasts every tuple to all tasks of the target Bolt	Distributing configuration updates or reference data to all processing instances
Global Grouping	Sends all tuples to the single task with the lowest ID	Producing a single global aggregate or ordering
Direct Grouping	The producer specifies exactly which task receives the tuple	Fine-grained routing control for advanced use cases
Local or Shuffle Grouping	Prefers routing to tasks within the same worker process; falls back to shuffle if no local task exists	Reducing network overhead by favoring intra-process communication

DAG Structure

The topology DAG imposes a partial order on processing. Tuples flow from Spouts through a series of Bolts, with each Bolt potentially feeding multiple downstream Bolts (fan-out) or receiving from multiple upstream components (fan-in). The acyclic constraint ensures there are no circular dependencies, which simplifies reasoning about data flow and enables Storm's tuple tracking mechanism for reliable processing.

Component IDs

Each Spout and Bolt is assigned a unique string identifier during wiring. These IDs serve multiple purposes:

They define the edges in the topology DAG (a Bolt references its input source by ID).
They appear in Storm's monitoring UI for operational visibility.
They are used in log messages for debugging.

Related Pages

Relationship	Page
implemented_by	Heibaiying_BigData_Notes_TopologyBuilder_Usage
related	Heibaiying_BigData_Notes_Storm_Spout_Implementation
related	Heibaiying_BigData_Notes_Storm_Bolt_Implementation
related	Heibaiying_BigData_Notes_Storm_Parallelism_Configuration

Implementation:Heibaiying_BigData_Notes_TopologyBuilder_Usage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment