Environment:Spotify Luigi Hadoop HDFS Cluster

Knowledge Sources	Spotify Luigi Apache Hadoop
Domains	Infrastructure, Big_Data, Distributed_Computing
Last Updated	2026-02-10 07:00 GMT

Overview

Hadoop cluster environment with HDFS access and MapReduce Streaming support for Luigi pipeline execution.

Description

This environment provides the Hadoop ecosystem dependencies required to run Luigi's Hadoop MapReduce and HDFS contrib modules. It requires a configured Hadoop CLI (the `hadoop` command), access to an HDFS cluster, and optionally the Hadoop Streaming JAR for running MapReduce jobs. Luigi supports both CDH4 (Hadoop 2+) and CDH3/Apache1 variants, with CDH4 as the default. The environment also supports WebHDFS as an alternative to the CLI client.

Usage

Use this environment for any pipeline that reads from or writes to HDFS, or that executes Hadoop MapReduce Streaming jobs. It is required for the Hadoop_MapReduce_Pipeline workflow and any task using `HdfsTarget`, `JobTask`, or `HadoopJarJobTask`.

System Requirements

Category	Requirement	Notes
OS	Linux	Hadoop CLI requires Linux; macOS possible for development
Hadoop	Hadoop 2.x+ (CDH4 default)	CDH3, Apache1 also supported via config
Java	JRE/JDK required by Hadoop	Version depends on Hadoop distribution
Network	Access to HDFS NameNode	Default port varies by distribution
Disk	Varies	Depends on data volume

Dependencies

System Packages

`hadoop` CLI binary (must be on PATH or configured via `[hadoop] command`)
`yarn` CLI (for YARN application management)
`mapred` CLI (for MapReduce job management)
Hadoop Streaming JAR (path configured via `[hadoop] streaming-jar`)

Python Packages

`hdfs` >= 2.0.4, < 3.0.0 (optional, for WebHDFS client)
`luigi` (core)

Credentials

The following configuration must be set in `luigi.cfg` or equivalent:

`[hadoop] command`: Path to hadoop binary (default: `hadoop`)
`[hadoop] version`: Hadoop version variant (default: `cdh4`, options: `cdh3`, `apache1`)
`[hadoop] streaming-jar`: Path to Hadoop Streaming JAR file
`[hadoop] python-executable`: Python binary on Hadoop nodes (default: `python`)
`[hadoop] scheduler`: YARN scheduler type (default: `fair`)

Environment variables:

`TMPDIR`: Used for temporary files during MapReduce job execution
`HADOOP_CONF_DIR`: Hadoop configuration directory (optional)
`HADOOP_USER_NAME`: User identity for HDFS operations (optional)

Quick Install

# Install Luigi with Hadoop/HDFS support
pip install luigi[cdh]

# Or for HDP distributions
pip install luigi[hdp]

Code Evidence

Hadoop CLI configuration from `luigi/contrib/hdfs/config.py:43-66`:

class hadoopcli(luigi.Config):
    command = luigi.Parameter(default="hadoop",
                              config_path=dict(section="hadoop", name="command"),
                              description='The hadoop command, will run split() on it, '
                                          'so you can pass something like "hadoop --param"')
    version = luigi.Parameter(default="cdh4",
                              config_path=dict(section="hadoop", name="version"),
                              description='Can also be cdh3 or apache1')

def load_hadoop_cmd():
    return hadoopcli().command.split()

def get_configured_hadoop_version():
    """
    CDH4 (hadoop 2+) has a slightly different syntax for interacting with hdfs
    via the command line.
    """
    return hadoopcli().version.lower()

Streaming JAR usage from `luigi/contrib/hadoop.py:467`:

arglist = luigi.contrib.hdfs.load_hadoop_cmd() + ['jar', self.streaming_jar]

YARN/MapReduce job control from `luigi/contrib/hadoop.py:221-224`:

subprocess.call(['yarn', 'application', '-kill', self.application_id])
# ...
subprocess.call(['mapred', 'job', '-kill', self.job_id])

Deprecation of core.tmp-dir from `luigi/contrib/hadoop.py:430-433`:

base_tmp_dir = configuration.get_config().get('core', 'tmp-dir', None)
if base_tmp_dir:
    warnings.warn("The core.tmp-dir configuration item is deprecated, "
                  "please use the TMPDIR environment variable...")

Common Errors

Error Message	Cause	Solution
`FileNotFoundError: hadoop: command not found`	Hadoop CLI not on PATH	Install Hadoop or set `[hadoop] command` in luigi.cfg
`HDFSCliError`	HDFS command failed	Check HDFS connectivity and permissions
`streaming-jar not configured`	Missing Streaming JAR path	Set `[hadoop] streaming-jar` in luigi.cfg
Renaming multiple files not atomic	Known limitation of HDFS rename	See `luigi/contrib/hdfs/hadoopcli_clients.py:96` warning

Compatibility Notes

CDH4 vs CDH3: Hadoop 2+ (CDH4) uses different CLI syntax than CDH3/Apache1. Set `[hadoop] version` accordingly.
WebHDFS: An alternative to the CLI client; requires the `hdfs` Python package and WebHDFS enabled on the cluster.
HDFS rename: Renaming multiple files at once is not atomic when using the CLI client. This is a known limitation documented in the code.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment