Environment:Spotify Luigi Hadoop HDFS Cluster
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Big_Data, Distributed_Computing |
| Last Updated | 2026-02-10 07:00 GMT |
Overview
Hadoop cluster environment with HDFS access and MapReduce Streaming support for Luigi pipeline execution.
Description
This environment provides the Hadoop ecosystem dependencies required to run Luigi's Hadoop MapReduce and HDFS contrib modules. It requires a configured Hadoop CLI (the `hadoop` command), access to an HDFS cluster, and optionally the Hadoop Streaming JAR for running MapReduce jobs. Luigi supports both CDH4 (Hadoop 2+) and CDH3/Apache1 variants, with CDH4 as the default. The environment also supports WebHDFS as an alternative to the CLI client.
Usage
Use this environment for any pipeline that reads from or writes to HDFS, or that executes Hadoop MapReduce Streaming jobs. It is required for the Hadoop_MapReduce_Pipeline workflow and any task using `HdfsTarget`, `JobTask`, or `HadoopJarJobTask`.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Hadoop CLI requires Linux; macOS possible for development |
| Hadoop | Hadoop 2.x+ (CDH4 default) | CDH3, Apache1 also supported via config |
| Java | JRE/JDK required by Hadoop | Version depends on Hadoop distribution |
| Network | Access to HDFS NameNode | Default port varies by distribution |
| Disk | Varies | Depends on data volume |
Dependencies
System Packages
- `hadoop` CLI binary (must be on PATH or configured via `[hadoop] command`)
- `yarn` CLI (for YARN application management)
- `mapred` CLI (for MapReduce job management)
- Hadoop Streaming JAR (path configured via `[hadoop] streaming-jar`)
Python Packages
- `hdfs` >= 2.0.4, < 3.0.0 (optional, for WebHDFS client)
- `luigi` (core)
Credentials
The following configuration must be set in `luigi.cfg` or equivalent:
- `[hadoop] command`: Path to hadoop binary (default: `hadoop`)
- `[hadoop] version`: Hadoop version variant (default: `cdh4`, options: `cdh3`, `apache1`)
- `[hadoop] streaming-jar`: Path to Hadoop Streaming JAR file
- `[hadoop] python-executable`: Python binary on Hadoop nodes (default: `python`)
- `[hadoop] scheduler`: YARN scheduler type (default: `fair`)
Environment variables:
- `TMPDIR`: Used for temporary files during MapReduce job execution
- `HADOOP_CONF_DIR`: Hadoop configuration directory (optional)
- `HADOOP_USER_NAME`: User identity for HDFS operations (optional)
Quick Install
# Install Luigi with Hadoop/HDFS support
pip install luigi[cdh]
# Or for HDP distributions
pip install luigi[hdp]
Code Evidence
Hadoop CLI configuration from `luigi/contrib/hdfs/config.py:43-66`:
class hadoopcli(luigi.Config):
command = luigi.Parameter(default="hadoop",
config_path=dict(section="hadoop", name="command"),
description='The hadoop command, will run split() on it, '
'so you can pass something like "hadoop --param"')
version = luigi.Parameter(default="cdh4",
config_path=dict(section="hadoop", name="version"),
description='Can also be cdh3 or apache1')
def load_hadoop_cmd():
return hadoopcli().command.split()
def get_configured_hadoop_version():
"""
CDH4 (hadoop 2+) has a slightly different syntax for interacting with hdfs
via the command line.
"""
return hadoopcli().version.lower()
Streaming JAR usage from `luigi/contrib/hadoop.py:467`:
arglist = luigi.contrib.hdfs.load_hadoop_cmd() + ['jar', self.streaming_jar]
YARN/MapReduce job control from `luigi/contrib/hadoop.py:221-224`:
subprocess.call(['yarn', 'application', '-kill', self.application_id])
# ...
subprocess.call(['mapred', 'job', '-kill', self.job_id])
Deprecation of core.tmp-dir from `luigi/contrib/hadoop.py:430-433`:
base_tmp_dir = configuration.get_config().get('core', 'tmp-dir', None)
if base_tmp_dir:
warnings.warn("The core.tmp-dir configuration item is deprecated, "
"please use the TMPDIR environment variable...")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `FileNotFoundError: hadoop: command not found` | Hadoop CLI not on PATH | Install Hadoop or set `[hadoop] command` in luigi.cfg |
| `HDFSCliError` | HDFS command failed | Check HDFS connectivity and permissions |
| `streaming-jar not configured` | Missing Streaming JAR path | Set `[hadoop] streaming-jar` in luigi.cfg |
| Renaming multiple files not atomic | Known limitation of HDFS rename | See `luigi/contrib/hdfs/hadoopcli_clients.py:96` warning |
Compatibility Notes
- CDH4 vs CDH3: Hadoop 2+ (CDH4) uses different CLI syntax than CDH3/Apache1. Set `[hadoop] version` accordingly.
- WebHDFS: An alternative to the CLI client; requires the `hdfs` Python package and WebHDFS enabled on the cluster.
- HDFS rename: Renaming multiple files at once is not atomic when using the CLI client. This is a known limitation documented in the code.