Principle:Spotify Luigi MapReduce Processing
Template:Knowledge Source
Template:Knowledge Source
Domains: Pipeline_Orchestration, Big_Data
Last Updated: 2026-02-10 00:00 GMT
Overview
MapReduce Processing is the paradigm of decomposing a large-scale data transformation into a map phase that processes individual records in parallel and a reduce phase that aggregates records sharing the same key.
Description
The MapReduce programming model, introduced by Google in 2004, provides a simple yet powerful abstraction for distributed data processing. A developer expresses computation as two functions:
- Mapper -- Receives one input record at a time and emits zero or more (key, value) pairs. The framework distributes input splits across worker nodes so that many mappers run in parallel.
- Reducer -- Receives a key and an iterator over all values associated with that key (after the framework sorts and groups mapper output by key). The reducer emits the final aggregated result for each key.
An optional third function, the combiner, acts as a local pre-reducer that runs on the mapper's output before it is shuffled across the network. Combiners reduce data transfer by partially aggregating values on the map side.
The framework handles all distributed systems concerns -- partitioning input, scheduling tasks across nodes, sorting and shuffling intermediate data, handling node failures, and writing output -- so the developer focuses solely on the map and reduce logic.
Beyond the core functions, a MapReduce job may also define:
- Reader -- Controls how raw input lines are parsed into records before being passed to the mapper.
- Writer -- Controls how reducer output records are formatted into the final output representation (e.g., tab-separated values).
- Final mapper / Final reducer -- Hook functions invoked after all regular map or reduce records have been processed, useful for emitting summary statistics or flushing accumulators.
Usage
Use MapReduce Processing when:
- Processing large datasets (gigabytes to petabytes) that are stored on a distributed file system.
- The computation can be naturally expressed as per-record transformations followed by key-based aggregation.
- You need automatic parallelism and fault tolerance without writing explicit distributed systems code.
- Building ETL pipelines that count, filter, join, or summarize data (e.g., word count, log aggregation, inverted index construction).
- You want to leverage commodity hardware clusters for batch processing.
Theoretical Basis
The MapReduce model draws from functional programming's map and fold (reduce) operations:
- Map phase -- Formally, given an input set I and a function f: record -> list[(key, value)], the map phase computes M = union(f(r) for r in I). Each application of f is independent, enabling embarrassingly parallel execution.
- Shuffle and sort -- The framework partitions M by key (typically via hash partitioning) and sorts within each partition. This ensures that all values for a given key are colocated on a single reducer node, yielding groups G_k = [(k, v1), (k, v2), ...] for each unique key k.
- Reduce phase -- Given a function g: (key, iterator[value]) -> list[result], the reduce phase computes R = union(g(k, values(G_k)) for k in keys(M)). Each key group is processed independently, again enabling parallelism across keys.
- Combiner optimization -- When the reduce function is associative and commutative (e.g., summation), a combiner can apply g locally on each mapper's output partition before the shuffle, reducing network I/O from O(|M|) to O(|keys| * |mappers|) in the best case.
- Fault tolerance -- If a map or reduce task fails, the framework re-executes it on another node using the same input split. The deterministic, side-effect-free nature of map and reduce functions makes re-execution safe.
- Data locality -- The scheduler assigns map tasks to nodes that store the corresponding HDFS blocks, minimizing network reads and exploiting disk bandwidth.