Principle:Heibaiying BigData Notes Storm Bolt Implementation

Overview

Property	Value
Concept	Storm Bolt Implementation
Category	Stream Processing / Data Transformation
Applies To	Apache Storm Topologies
Prerequisites	Understanding of Storm topology architecture, Spouts, and the tuple data model

Description

A Bolt is the processing unit in an Apache Storm topology. Bolts receive tuples from upstream Spouts or other Bolts, perform computations on the data, and optionally emit new tuples downstream to the next stage of the processing pipeline.

While Spouts are responsible for data ingestion, Bolts are where the actual business logic resides. A topology can contain any number of Bolts arranged in a DAG (directed acyclic graph), enabling complex multi-stage data transformations such as filtering, aggregation, joining, database writes, and machine learning inference.

In Storm's programming model, a Bolt is implemented by extending the BaseRichBolt abstract class (or implementing the IRichBolt interface directly). The developer must provide implementations for three core lifecycle methods:

prepare() -- Called once when the Bolt is initialized. This is where the OutputCollector reference is captured and any resources (database connections, caches) are initialized.
execute() -- Called once for each incoming tuple. This is where the processing logic lives. The Bolt reads fields from the input tuple, performs transformations, and optionally emits new tuples.
declareOutputFields() -- Declares the schema (field names) of the tuples that this Bolt emits. If the Bolt is a terminal node (sink) that does not emit tuples, this method body is left empty.

Usage

Bolts serve a wide variety of roles within a Storm topology:

Transformation Bolts -- Parse, split, or reformat incoming data (e.g., splitting a sentence into individual words).
Aggregation Bolts -- Accumulate counts, sums, averages, or other statistics over a stream of tuples.
Filter Bolts -- Selectively pass through or drop tuples based on conditions.
Join Bolts -- Combine tuples from multiple streams based on a common key.
Sink Bolts -- Write results to external systems such as databases, file systems, or dashboards. These are terminal Bolts that do not emit downstream tuples.

A typical Bolt implementation follows this pattern:

public class MyBolt extends BaseRichBolt {
    private OutputCollector collector;

    @Override
    public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
        this.collector = collector;
        // Initialize resources here
    }

    @Override
    public void execute(Tuple input) {
        // Read from input tuple
        String value = input.getStringByField("fieldName");
        // Process the data
        String result = transform(value);
        // Emit downstream (if not a terminal bolt)
        collector.emit(new Values(result));
    }

    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("resultField"));
    }
}

Theoretical Basis

Bolts embody the operator concept in the dataflow programming model. Each Bolt is a stateless or stateful operator that transforms an input stream into an output stream. The composition of multiple Bolts forms a processing pipeline where data flows through successive transformation stages.

Key theoretical concepts related to Bolt processing:

Tuple-at-a-time processing: Storm follows a tuple-at-a-time processing model, where execute() is invoked once per incoming tuple. This provides low-latency processing at the cost of higher overhead compared to micro-batch systems. Each tuple is processed independently, making the model simple to reason about.

Stream groupings: The way tuples are routed from one Bolt to the next is controlled by stream groupings. These determine the partitioning strategy:

Shuffle grouping -- Random distribution across downstream tasks for load balancing.
Fields grouping -- Hash-based partitioning by specified fields, ensuring all tuples with the same field value go to the same task.
All grouping -- Broadcast to all downstream tasks.
Global grouping -- Route all tuples to a single task (lowest task ID).

State management: Bolts can maintain in-memory state (e.g., a HashMap for counting). However, this state is not automatically durable. For fault-tolerant stateful processing, Storm provides the Trident API or integration with external state stores.

Anchoring and acknowledgment: For reliable processing, Bolts should anchor emitted tuples to their input tuples using collector.emit(inputTuple, new Values(...)) and call collector.ack(inputTuple) after processing. This enables Storm's tuple tracking mechanism to guarantee at-least-once processing. When using BaseRichBolt, anchoring and acking are the developer's responsibility; BaseBasicBolt handles this automatically.

Related Pages

Relationship	Page
implemented_by	Heibaiying_BigData_Notes_SplitBolt_and_CountBolt
related	Heibaiying_BigData_Notes_Storm_Spout_Implementation
related	Heibaiying_BigData_Notes_Storm_Topology_Wiring

Implementation:Heibaiying_BigData_Notes_SplitBolt_and_CountBolt

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment