Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Heibaiying BigData Notes Storm Topology Wiring

From Leeroopedia


Overview

Property Value
Concept Storm Topology Wiring
Category Stream Processing / Topology Construction
Applies To Apache Storm Topologies
Prerequisites Understanding of Spout and Bolt implementations, and the concept of stream groupings

Description

A Storm Topology is a directed acyclic graph (DAG) that defines how data flows from sources (Spouts) through processing stages (Bolts). Wiring a topology is the act of using the TopologyBuilder API to register Spouts and Bolts, and to specify how tuples are routed between them via stream groupings.

The TopologyBuilder class is the primary API for constructing Storm topologies. It provides two key methods:

  • setSpout(id, spout) -- Registers a Spout with a unique string identifier.
  • setBolt(id, bolt) -- Registers a Bolt with a unique string identifier and returns a declaration object that allows the developer to specify input sources and stream groupings.

After all components are registered, builder.createTopology() produces a StormTopology object that can be submitted to either a local cluster or a remote production cluster.

Usage

Topology wiring is the central configuration step in any Storm application. The developer must decide:

  • Which Spouts will produce data and what their component IDs will be.
  • Which Bolts will process data, how they connect to upstream components, and what stream groupings to use.
  • The overall DAG structure -- whether the topology is a simple linear chain, a fan-out/fan-in pattern, or a complex graph with multiple branches.

A typical wiring pattern for the word-count topology:

TopologyBuilder builder = new TopologyBuilder();

// Register the data source
builder.setSpout("DataSourceSpout", new DataSourceSpout());

// Register processing bolts with stream groupings
builder.setBolt("SplitBolt", new SplitBolt())
       .shuffleGrouping("DataSourceSpout");
builder.setBolt("CountBolt", new CountBolt())
       .shuffleGrouping("SplitBolt");

// Build the topology
StormTopology topology = builder.createTopology();

Theoretical Basis

Topology wiring embodies the composition principle in dataflow programming. A complex computation is decomposed into simple, reusable operators (Spouts and Bolts) that are composed together via explicit data routing rules.

Stream Groupings

Stream groupings are the mechanism by which Storm determines how tuples are distributed from a source component to a target component. The choice of grouping has significant implications for processing semantics and load balancing:

Grouping Description Use Case
Shuffle Grouping Distributes tuples randomly and evenly across all tasks of the target Bolt General-purpose load balancing; use when tuple ordering and affinity do not matter
Fields Grouping Routes tuples to target tasks based on a hash of specified fields; tuples with the same field values always go to the same task Aggregation by key (e.g., word counting); ensures all tuples for a given word reach the same task
All Grouping Broadcasts every tuple to all tasks of the target Bolt Distributing configuration updates or reference data to all processing instances
Global Grouping Sends all tuples to the single task with the lowest ID Producing a single global aggregate or ordering
Direct Grouping The producer specifies exactly which task receives the tuple Fine-grained routing control for advanced use cases
Local or Shuffle Grouping Prefers routing to tasks within the same worker process; falls back to shuffle if no local task exists Reducing network overhead by favoring intra-process communication

DAG Structure

The topology DAG imposes a partial order on processing. Tuples flow from Spouts through a series of Bolts, with each Bolt potentially feeding multiple downstream Bolts (fan-out) or receiving from multiple upstream components (fan-in). The acyclic constraint ensures there are no circular dependencies, which simplifies reasoning about data flow and enables Storm's tuple tracking mechanism for reliable processing.

Component IDs

Each Spout and Bolt is assigned a unique string identifier during wiring. These IDs serve multiple purposes:

  • They define the edges in the topology DAG (a Bolt references its input source by ID).
  • They appear in Storm's monitoring UI for operational visibility.
  • They are used in log messages for debugging.

Related Pages

Relationship Page
implemented_by Heibaiying_BigData_Notes_TopologyBuilder_Usage
related Heibaiying_BigData_Notes_Storm_Spout_Implementation
related Heibaiying_BigData_Notes_Storm_Bolt_Implementation
related Heibaiying_BigData_Notes_Storm_Parallelism_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment