Principle:Heibaiying BigData Notes MapReduce Partitioner
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
The Partitioner determines which reducer receives each intermediate key-value pair, controlling the distribution of data across reduce tasks.
Description
After the Map phase (and optional Combiner phase), the framework must decide which reducer should process each intermediate key. The Partitioner is the component responsible for this routing decision. It takes a key, a value, and the total number of reducers, and returns an integer partition index in the range [0, numPartitions - 1].
The default partitioner in Hadoop is HashPartitioner, which computes the partition as (key.hashCode() & Integer.MAX_VALUE) % numReducers. This provides a roughly uniform distribution of keys across reducers, assuming the hash function distributes keys evenly.
However, in some use cases, the default hash-based partitioning is insufficient. You may want to ensure that specific keys go to specific reducers -- for example, to produce output files with a known, deterministic assignment of keys. A custom Partitioner overrides the getPartition() method to implement application-specific routing logic.
In the word count example, a custom Partitioner assigns each word to a specific reducer based on the word's index in a predefined vocabulary list. This guarantees that each output file contains exactly one word's count, making the output deterministic and easy to verify.
Usage
Use a custom Partitioner when:
- You need deterministic assignment of keys to reducers (e.g., for testing or verification).
- The default hash partitioning produces skewed data distribution (some reducers get much more data than others).
- You want to co-locate related keys on the same reducer for secondary sort or join operations.
- Business logic requires specific keys to appear in specific output files.
Use the default HashPartitioner when:
- Keys are uniformly distributed and no special routing is needed.
- You have no requirement for deterministic key-to-reducer assignment.
Theoretical Basis
The Partitioner function is defined as:
partition: (K2, V2, numPartitions) -> int
Where the returned integer p satisfies 0 <= p < numPartitions.
Default HashPartitioner:
p = (key.hashCode() & 0x7FFFFFFF) % numPartitions
The bitwise AND with 0x7FFFFFFF (Integer.MAX_VALUE) ensures the hash code is non-negative before taking the modulus.
Custom index-based Partitioner:
Given a vocabulary list W = [w0, w1, ..., wn-1] and numPartitions = |W|:
p = indexOf(key, W)
This assigns each word wi to partition i. If numPartitions equals the vocabulary size, each reducer handles exactly one word. The number of reduce tasks must be set to match the vocabulary size for this scheme to work correctly.
Data skew considerations:
If partition sizes are uneven, some reducers will take longer than others, creating stragglers that delay job completion. A well-designed Partitioner ensures that the total data volume is distributed as evenly as possible across all reduce tasks. The ideal partition function satisfies:
|data(p_i)| approximately equals |data(p_j)| for all partitions i and j