Principle:Apache Flink Hadoop Compatibility
| Knowledge Sources | |
|---|---|
| Domains | Hadoop_Compatibility, Legacy_Integration |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Description
The Hadoop Compatibility Layer enables Apache Flink to consume and produce data through Hadoop's InputFormat and OutputFormat APIs. Located in the flink-hadoop-compatibility module, this framework provides wrapper classes for both the legacy org.apache.hadoop.mapred (Hadoop 1.x) and the newer org.apache.hadoop.mapreduce (Hadoop 2.x) API families. It also includes type system integration for Hadoop's Writable serialization framework, utility classes, and MapReduce function wrappers.
The framework is organized into parallel hierarchies for the two Hadoop API generations:
mapred (Legacy API) Wrappers:
- HadoopInputFormatBase -- An
@Internalabstract class extendingHadoopInputFormatCommonBasethat wraps amapred.InputFormat<K, V>. It managesJobConf, createsHadoopInputSplitarrays from the Hadoop format, opens splits viaRecordReader, and applies mutex-based synchronization onopen(),configure(), andclose()to handle Flink's thread-based parallelism (Hadoop assumes JVM-level isolation). - HadoopInputFormat -- Concrete implementation producing Flink
Tuple2<K, V>records. - HadoopOutputFormat / HadoopOutputFormatBase -- Corresponding wrappers for writing through
mapred.OutputFormat. - HadoopInputSplit -- A Flink
LocatableInputSplitwrapper aroundmapred.InputSplit.
mapreduce (New API) Wrappers:
- HadoopInputFormatBase (mapreduce variant) -- Similar base class for
mapreduce.InputFormat, usingJob/TaskAttemptContextinstead ofJobConf. - HadoopInputFormat / HadoopOutputFormat / HadoopOutputFormatBase -- Concrete wrappers for the new API.
- HadoopInputSplit -- Wrapper around
mapreduce.InputSplit.
Type System Integration:
- WritableTypeInfo -- Flink
TypeInformationfor HadoopWritabletypes. - WritableComparator -- Flink
TypeComparatorforWritableComparabletypes. - WritableSerializer -- Flink
TypeSerializerforWritabletypes.
Utility and Function Wrappers:
- HadoopInputs -- Factory methods for creating Hadoop input formats.
- HadoopUtils -- Configuration merging utilities (
mergeHadoopConf). - HadoopMapFunction / HadoopReducerWrappedFunction -- Wrappers allowing Hadoop Mapper and Reducer implementations to run as Flink functions.
Theoretical Basis
The Hadoop Compatibility Layer applies the Adapter pattern systematically to bridge two fundamentally different execution models. Hadoop's MapReduce assumes process-level isolation (one JVM per task), while Flink uses thread-level parallelism (multiple tasks in a single JVM). The HadoopInputFormatBase classes address this mismatch through mutex-based synchronization on lifecycle methods (OPEN_MUTEX, CONFIGURE_MUTEX, CLOSE_MUTEX), enforcing the sequential access semantics that Hadoop InputFormats may implicitly depend upon.
The framework uses custom Java serialization (overriding writeObject/readObject) to handle the non-serializable nature of Hadoop's JobConf and InputFormat instances. During serialization, class names are written as strings and the JobConf is serialized via its Writable interface. During deserialization, classes are dynamically loaded via Thread.currentThread().getContextClassLoader() and credentials are restored from both the serialized state and the current UserGroupInformation.
The parallel hierarchy for mapred and mapreduce packages reflects the two distinct Hadoop API generations without attempting to unify them, since their lifecycle models (push-based RecordReader.next(key, value) vs. pull-based RecordReader.nextKeyValue()) are structurally incompatible. Each hierarchy provides its own HadoopInputSplit wrapper, maintaining type safety and avoiding leaky abstractions.
| API Family | Input Wrapper | Output Wrapper | Split Wrapper |
|---|---|---|---|
mapred (legacy) |
HadoopInputFormatBase / HadoopInputFormat |
HadoopOutputFormatBase / HadoopOutputFormat |
mapred.HadoopInputSplit
|
mapreduce (new) |
HadoopInputFormatBase / HadoopInputFormat |
HadoopOutputFormatBase / HadoopOutputFormat |
mapreduce.HadoopInputSplit
|
Related Pages
- Implementation:Apache_Flink_Mapred_HadoopInputFormat
- Implementation:Apache_Flink_Mapred_HadoopOutputFormat
- Implementation:Apache_Flink_Mapred_HadoopOutputFormatBase
- Implementation:Apache_Flink_Mapred_HadoopInputFormatBase
- Implementation:Apache_Flink_Mapred_HadoopInputSplit
- Implementation:Apache_Flink_Mapreduce_HadoopInputFormat
- Implementation:Apache_Flink_Mapreduce_HadoopOutputFormat
- Implementation:Apache_Flink_Mapreduce_HadoopOutputFormatBase
- Implementation:Apache_Flink_Mapreduce_HadoopInputFormatBase
- Implementation:Apache_Flink_Mapreduce_HadoopInputSplit
- Implementation:Apache_Flink_WritableTypeInfo
- Implementation:Apache_Flink_WritableComparator
- Implementation:Apache_Flink_WritableSerializer
- Implementation:Apache_Flink_HadoopInputs
- Implementation:Apache_Flink_HadoopUtils
- Implementation:Apache_Flink_HadoopMapFunction
- Implementation:Apache_Flink_HadoopReducerWrappedFunction