Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Apache Flink Hadoop Compatibility Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Hadoop
Last Updated 2026-02-09 13:00 GMT

Overview

Hadoop 2.10.2 runtime environment required for using the flink-hadoop-compatibility module to run Hadoop MapReduce InputFormats and OutputFormats within Apache Flink.

Description

This environment provides the Hadoop runtime libraries needed to use the `flink-hadoop-compatibility` module. This module wraps Hadoop's MapReduce `InputFormat` and `OutputFormat` interfaces so they can be used as Flink data sources and sinks. The Hadoop version is pinned at 2.10.2 in the project properties. The module requires hadoop-common, hadoop-hdfs, and hadoop-mapreduce-client-core libraries.

A critical aspect of this environment is that Hadoop assumes JVM-level isolation between tasks (one task per JVM), while Flink uses thread-level parallelism (multiple tasks share a JVM). This architectural difference requires special mutex-based synchronization when using Hadoop InputFormats within Flink.

Usage

Use this environment when you need to reuse existing Hadoop InputFormat or OutputFormat implementations within a Flink job. Common use cases include reading from Hadoop-compatible file systems (HDFS, S3 via Hadoop FS) or interoperating with legacy MapReduce pipelines.

System Requirements

Category Requirement Notes
OS Linux (recommended), macOS Hadoop is primarily tested on Linux
Hardware x86_64 CPU Standard requirements
RAM 4GB minimum Hadoop client libraries can be memory-intensive
Disk 5GB SSD For Hadoop client JARs and configuration

Dependencies

System Packages

  • Java Development Kit (JDK) 11, 17, or 21
  • Maven 3.8.6 (for building)

Java Dependencies

  • `hadoop-common` = 2.10.2
  • `hadoop-hdfs` = 2.10.2
  • `hadoop-mapreduce-client-core` = 2.10.2
  • `hadoop-yarn-common` = 2.10.2
  • `hadoop-yarn-client` = 2.10.2

Credentials

The following environment variables may be needed depending on the Hadoop cluster configuration:

  • `HADOOP_HOME`: Path to Hadoop installation directory.
  • `HADOOP_CONF_DIR`: Path to Hadoop configuration directory (containing `core-site.xml`, `hdfs-site.xml`).
  • Kerberos credentials if the Hadoop cluster uses Kerberos authentication.

Note: During Maven builds, `HADOOP_HOME` and `HADOOP_CONF_DIR` are intentionally set to empty to isolate the build from any external Hadoop environment.

Quick Install

# Build the Hadoop compatibility module
cd flink
./mvnw clean package -pl flink-connectors/flink-hadoop-compatibility -DskipTests \
    -Dflink.hadoop.version=2.10.2

# To use a different Hadoop version (e.g., 3.x), override:
./mvnw clean package -pl flink-connectors/flink-hadoop-compatibility -DskipTests \
    -Dflink.hadoop.version=3.3.4

Code Evidence

Hadoop version property from `pom.xml:115`:

<flink.hadoop.version>2.10.2</flink.hadoop.version>

Thread-safety concern documented in `HadoopInputFormatBase.java:65-71`:

// Mutexes to avoid concurrent operations on Hadoop InputFormats.
// Hadoop parallelizes tasks across JVMs which is why they might rely on this JVM isolation.
// In contrast, Flink parallelizes using Threads, so multiple Hadoop InputFormat instances
// might be used in the same JVM.
private static final Object OPEN_MUTEX = new Object();
private static final Object CONFIGURE_MUTEX = new Object();
private static final Object CLOSE_MUTEX = new Object();

CI build isolation from `flink-connectors/flink-hadoop-compatibility/pom.xml:184-188`:

<!-- Set HADOOP_HOME and HADOOP_CONF_DIR to empty during Maven builds -->

Common Errors

Error Message Cause Solution
`NoClassDefFoundError: org/apache/hadoop/mapreduce/InputFormat` Hadoop JARs not on classpath Add `hadoop-mapreduce-client-core` dependency
`ConcurrentModificationException` in Hadoop InputFormat Thread-unsafe Hadoop code Ensure Flink's mutex synchronization is active (use `HadoopInputFormatBase`)
`HADOOP_HOME is not set` Missing Hadoop environment variable Set `HADOOP_HOME` to Hadoop installation directory

Compatibility Notes

  • Hadoop 2.x: Default and tested version is 2.10.2. The wrapping layer is designed for the Hadoop 2 API.
  • Hadoop 3.x: Can be used by overriding `flink.hadoop.version` in the Maven build. The MapReduce API is backward-compatible.
  • Thread Safety: All Hadoop InputFormat lifecycle methods (open, configure, close) are serialized with static mutexes because Hadoop assumes JVM isolation. See the Hadoop Thread Safety Mutexes heuristic for details.
  • Serialization: `HadoopInputFormatBase` uses custom Java serialization (all fields are effectively transient) due to Hadoop Configuration not being natively serializable.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment