Environment:Apache Flink Hadoop Compatibility Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Hadoop |
| Last Updated | 2026-02-09 13:00 GMT |
Overview
Hadoop 2.10.2 runtime environment required for using the flink-hadoop-compatibility module to run Hadoop MapReduce InputFormats and OutputFormats within Apache Flink.
Description
This environment provides the Hadoop runtime libraries needed to use the `flink-hadoop-compatibility` module. This module wraps Hadoop's MapReduce `InputFormat` and `OutputFormat` interfaces so they can be used as Flink data sources and sinks. The Hadoop version is pinned at 2.10.2 in the project properties. The module requires hadoop-common, hadoop-hdfs, and hadoop-mapreduce-client-core libraries.
A critical aspect of this environment is that Hadoop assumes JVM-level isolation between tasks (one task per JVM), while Flink uses thread-level parallelism (multiple tasks share a JVM). This architectural difference requires special mutex-based synchronization when using Hadoop InputFormats within Flink.
Usage
Use this environment when you need to reuse existing Hadoop InputFormat or OutputFormat implementations within a Flink job. Common use cases include reading from Hadoop-compatible file systems (HDFS, S3 via Hadoop FS) or interoperating with legacy MapReduce pipelines.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended), macOS | Hadoop is primarily tested on Linux |
| Hardware | x86_64 CPU | Standard requirements |
| RAM | 4GB minimum | Hadoop client libraries can be memory-intensive |
| Disk | 5GB SSD | For Hadoop client JARs and configuration |
Dependencies
System Packages
- Java Development Kit (JDK) 11, 17, or 21
- Maven 3.8.6 (for building)
Java Dependencies
- `hadoop-common` = 2.10.2
- `hadoop-hdfs` = 2.10.2
- `hadoop-mapreduce-client-core` = 2.10.2
- `hadoop-yarn-common` = 2.10.2
- `hadoop-yarn-client` = 2.10.2
Credentials
The following environment variables may be needed depending on the Hadoop cluster configuration:
- `HADOOP_HOME`: Path to Hadoop installation directory.
- `HADOOP_CONF_DIR`: Path to Hadoop configuration directory (containing `core-site.xml`, `hdfs-site.xml`).
- Kerberos credentials if the Hadoop cluster uses Kerberos authentication.
Note: During Maven builds, `HADOOP_HOME` and `HADOOP_CONF_DIR` are intentionally set to empty to isolate the build from any external Hadoop environment.
Quick Install
# Build the Hadoop compatibility module
cd flink
./mvnw clean package -pl flink-connectors/flink-hadoop-compatibility -DskipTests \
-Dflink.hadoop.version=2.10.2
# To use a different Hadoop version (e.g., 3.x), override:
./mvnw clean package -pl flink-connectors/flink-hadoop-compatibility -DskipTests \
-Dflink.hadoop.version=3.3.4
Code Evidence
Hadoop version property from `pom.xml:115`:
<flink.hadoop.version>2.10.2</flink.hadoop.version>
Thread-safety concern documented in `HadoopInputFormatBase.java:65-71`:
// Mutexes to avoid concurrent operations on Hadoop InputFormats.
// Hadoop parallelizes tasks across JVMs which is why they might rely on this JVM isolation.
// In contrast, Flink parallelizes using Threads, so multiple Hadoop InputFormat instances
// might be used in the same JVM.
private static final Object OPEN_MUTEX = new Object();
private static final Object CONFIGURE_MUTEX = new Object();
private static final Object CLOSE_MUTEX = new Object();
CI build isolation from `flink-connectors/flink-hadoop-compatibility/pom.xml:184-188`:
<!-- Set HADOOP_HOME and HADOOP_CONF_DIR to empty during Maven builds -->
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `NoClassDefFoundError: org/apache/hadoop/mapreduce/InputFormat` | Hadoop JARs not on classpath | Add `hadoop-mapreduce-client-core` dependency |
| `ConcurrentModificationException` in Hadoop InputFormat | Thread-unsafe Hadoop code | Ensure Flink's mutex synchronization is active (use `HadoopInputFormatBase`) |
| `HADOOP_HOME is not set` | Missing Hadoop environment variable | Set `HADOOP_HOME` to Hadoop installation directory |
Compatibility Notes
- Hadoop 2.x: Default and tested version is 2.10.2. The wrapping layer is designed for the Hadoop 2 API.
- Hadoop 3.x: Can be used by overriding `flink.hadoop.version` in the Maven build. The MapReduce API is backward-compatible.
- Thread Safety: All Hadoop InputFormat lifecycle methods (open, configure, close) are serialized with static mutexes because Hadoop assumes JVM isolation. See the Hadoop Thread Safety Mutexes heuristic for details.
- Serialization: `HadoopInputFormatBase` uses custom Java serialization (all fields are effectively transient) due to Hadoop Configuration not being natively serializable.