Environment:Apache Flink Hadoop Compatibility Environment

Knowledge Sources	Apache Flink Hadoop Compatibility
Domains	Infrastructure, Hadoop
Last Updated	2026-02-09 13:00 GMT

Overview

Hadoop 2.10.2 runtime environment required for using the flink-hadoop-compatibility module to run Hadoop MapReduce InputFormats and OutputFormats within Apache Flink.

Description

This environment provides the Hadoop runtime libraries needed to use the `flink-hadoop-compatibility` module. This module wraps Hadoop's MapReduce `InputFormat` and `OutputFormat` interfaces so they can be used as Flink data sources and sinks. The Hadoop version is pinned at 2.10.2 in the project properties. The module requires hadoop-common, hadoop-hdfs, and hadoop-mapreduce-client-core libraries.

A critical aspect of this environment is that Hadoop assumes JVM-level isolation between tasks (one task per JVM), while Flink uses thread-level parallelism (multiple tasks share a JVM). This architectural difference requires special mutex-based synchronization when using Hadoop InputFormats within Flink.

Usage

Use this environment when you need to reuse existing Hadoop InputFormat or OutputFormat implementations within a Flink job. Common use cases include reading from Hadoop-compatible file systems (HDFS, S3 via Hadoop FS) or interoperating with legacy MapReduce pipelines.

System Requirements

Category	Requirement	Notes
OS	Linux (recommended), macOS	Hadoop is primarily tested on Linux
Hardware	x86_64 CPU	Standard requirements
RAM	4GB minimum	Hadoop client libraries can be memory-intensive
Disk	5GB SSD	For Hadoop client JARs and configuration

Dependencies

System Packages

Java Development Kit (JDK) 11, 17, or 21
Maven 3.8.6 (for building)

Java Dependencies

`hadoop-common` = 2.10.2
`hadoop-hdfs` = 2.10.2
`hadoop-mapreduce-client-core` = 2.10.2
`hadoop-yarn-common` = 2.10.2
`hadoop-yarn-client` = 2.10.2

Credentials

The following environment variables may be needed depending on the Hadoop cluster configuration:

`HADOOP_HOME`: Path to Hadoop installation directory.
`HADOOP_CONF_DIR`: Path to Hadoop configuration directory (containing `core-site.xml`, `hdfs-site.xml`).
Kerberos credentials if the Hadoop cluster uses Kerberos authentication.

Note: During Maven builds, `HADOOP_HOME` and `HADOOP_CONF_DIR` are intentionally set to empty to isolate the build from any external Hadoop environment.

Quick Install

# Build the Hadoop compatibility module
cd flink
./mvnw clean package -pl flink-connectors/flink-hadoop-compatibility -DskipTests \
    -Dflink.hadoop.version=2.10.2

# To use a different Hadoop version (e.g., 3.x), override:
./mvnw clean package -pl flink-connectors/flink-hadoop-compatibility -DskipTests \
    -Dflink.hadoop.version=3.3.4

Code Evidence

Hadoop version property from `pom.xml:115`:

<flink.hadoop.version>2.10.2</flink.hadoop.version>

Thread-safety concern documented in `HadoopInputFormatBase.java:65-71`:

// Mutexes to avoid concurrent operations on Hadoop InputFormats.
// Hadoop parallelizes tasks across JVMs which is why they might rely on this JVM isolation.
// In contrast, Flink parallelizes using Threads, so multiple Hadoop InputFormat instances
// might be used in the same JVM.
private static final Object OPEN_MUTEX = new Object();
private static final Object CONFIGURE_MUTEX = new Object();
private static final Object CLOSE_MUTEX = new Object();

CI build isolation from `flink-connectors/flink-hadoop-compatibility/pom.xml:184-188`:

<!-- Set HADOOP_HOME and HADOOP_CONF_DIR to empty during Maven builds -->

Common Errors

Error Message	Cause	Solution
`NoClassDefFoundError: org/apache/hadoop/mapreduce/InputFormat`	Hadoop JARs not on classpath	Add `hadoop-mapreduce-client-core` dependency
`ConcurrentModificationException` in Hadoop InputFormat	Thread-unsafe Hadoop code	Ensure Flink's mutex synchronization is active (use `HadoopInputFormatBase`)
`HADOOP_HOME is not set`	Missing Hadoop environment variable	Set `HADOOP_HOME` to Hadoop installation directory

Compatibility Notes

Hadoop 2.x: Default and tested version is 2.10.2. The wrapping layer is designed for the Hadoop 2 API.
Hadoop 3.x: Can be used by overriding `flink.hadoop.version` in the Maven build. The MapReduce API is backward-compatible.
Thread Safety: All Hadoop InputFormat lifecycle methods (open, configure, close) are serialized with static mutexes because Hadoop assumes JVM isolation. See the Hadoop Thread Safety Mutexes heuristic for details.
Serialization: `HadoopInputFormatBase` uses custom Java serialization (all fields are effectively transient) due to Hadoop Configuration not being natively serializable.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment