Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Apache Flink Hadoop Compatibility

From Leeroopedia


Knowledge Sources
Domains Hadoop_Compatibility, Legacy_Integration
Last Updated 2026-02-09 00:00 GMT

Overview

Description

The Hadoop Compatibility Layer enables Apache Flink to consume and produce data through Hadoop's InputFormat and OutputFormat APIs. Located in the flink-hadoop-compatibility module, this framework provides wrapper classes for both the legacy org.apache.hadoop.mapred (Hadoop 1.x) and the newer org.apache.hadoop.mapreduce (Hadoop 2.x) API families. It also includes type system integration for Hadoop's Writable serialization framework, utility classes, and MapReduce function wrappers.

The framework is organized into parallel hierarchies for the two Hadoop API generations:

mapred (Legacy API) Wrappers:

  • HadoopInputFormatBase -- An @Internal abstract class extending HadoopInputFormatCommonBase that wraps a mapred.InputFormat<K, V>. It manages JobConf, creates HadoopInputSplit arrays from the Hadoop format, opens splits via RecordReader, and applies mutex-based synchronization on open(), configure(), and close() to handle Flink's thread-based parallelism (Hadoop assumes JVM-level isolation).
  • HadoopInputFormat -- Concrete implementation producing Flink Tuple2<K, V> records.
  • HadoopOutputFormat / HadoopOutputFormatBase -- Corresponding wrappers for writing through mapred.OutputFormat.
  • HadoopInputSplit -- A Flink LocatableInputSplit wrapper around mapred.InputSplit.

mapreduce (New API) Wrappers:

  • HadoopInputFormatBase (mapreduce variant) -- Similar base class for mapreduce.InputFormat, using Job/TaskAttemptContext instead of JobConf.
  • HadoopInputFormat / HadoopOutputFormat / HadoopOutputFormatBase -- Concrete wrappers for the new API.
  • HadoopInputSplit -- Wrapper around mapreduce.InputSplit.

Type System Integration:

  • WritableTypeInfo -- Flink TypeInformation for Hadoop Writable types.
  • WritableComparator -- Flink TypeComparator for WritableComparable types.
  • WritableSerializer -- Flink TypeSerializer for Writable types.

Utility and Function Wrappers:

  • HadoopInputs -- Factory methods for creating Hadoop input formats.
  • HadoopUtils -- Configuration merging utilities (mergeHadoopConf).
  • HadoopMapFunction / HadoopReducerWrappedFunction -- Wrappers allowing Hadoop Mapper and Reducer implementations to run as Flink functions.

Theoretical Basis

The Hadoop Compatibility Layer applies the Adapter pattern systematically to bridge two fundamentally different execution models. Hadoop's MapReduce assumes process-level isolation (one JVM per task), while Flink uses thread-level parallelism (multiple tasks in a single JVM). The HadoopInputFormatBase classes address this mismatch through mutex-based synchronization on lifecycle methods (OPEN_MUTEX, CONFIGURE_MUTEX, CLOSE_MUTEX), enforcing the sequential access semantics that Hadoop InputFormats may implicitly depend upon.

The framework uses custom Java serialization (overriding writeObject/readObject) to handle the non-serializable nature of Hadoop's JobConf and InputFormat instances. During serialization, class names are written as strings and the JobConf is serialized via its Writable interface. During deserialization, classes are dynamically loaded via Thread.currentThread().getContextClassLoader() and credentials are restored from both the serialized state and the current UserGroupInformation.

The parallel hierarchy for mapred and mapreduce packages reflects the two distinct Hadoop API generations without attempting to unify them, since their lifecycle models (push-based RecordReader.next(key, value) vs. pull-based RecordReader.nextKeyValue()) are structurally incompatible. Each hierarchy provides its own HadoopInputSplit wrapper, maintaining type safety and avoiding leaky abstractions.

API Family Input Wrapper Output Wrapper Split Wrapper
mapred (legacy) HadoopInputFormatBase / HadoopInputFormat HadoopOutputFormatBase / HadoopOutputFormat mapred.HadoopInputSplit
mapreduce (new) HadoopInputFormatBase / HadoopInputFormat HadoopOutputFormatBase / HadoopOutputFormat mapreduce.HadoopInputSplit

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment