Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Spotify Luigi Hadoop HDFS Cluster

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Big_Data, Distributed_Computing
Last Updated 2026-02-10 07:00 GMT

Overview

Hadoop cluster environment with HDFS access and MapReduce Streaming support for Luigi pipeline execution.

Description

This environment provides the Hadoop ecosystem dependencies required to run Luigi's Hadoop MapReduce and HDFS contrib modules. It requires a configured Hadoop CLI (the `hadoop` command), access to an HDFS cluster, and optionally the Hadoop Streaming JAR for running MapReduce jobs. Luigi supports both CDH4 (Hadoop 2+) and CDH3/Apache1 variants, with CDH4 as the default. The environment also supports WebHDFS as an alternative to the CLI client.

Usage

Use this environment for any pipeline that reads from or writes to HDFS, or that executes Hadoop MapReduce Streaming jobs. It is required for the Hadoop_MapReduce_Pipeline workflow and any task using `HdfsTarget`, `JobTask`, or `HadoopJarJobTask`.

System Requirements

Category Requirement Notes
OS Linux Hadoop CLI requires Linux; macOS possible for development
Hadoop Hadoop 2.x+ (CDH4 default) CDH3, Apache1 also supported via config
Java JRE/JDK required by Hadoop Version depends on Hadoop distribution
Network Access to HDFS NameNode Default port varies by distribution
Disk Varies Depends on data volume

Dependencies

System Packages

  • `hadoop` CLI binary (must be on PATH or configured via `[hadoop] command`)
  • `yarn` CLI (for YARN application management)
  • `mapred` CLI (for MapReduce job management)
  • Hadoop Streaming JAR (path configured via `[hadoop] streaming-jar`)

Python Packages

  • `hdfs` >= 2.0.4, < 3.0.0 (optional, for WebHDFS client)
  • `luigi` (core)

Credentials

The following configuration must be set in `luigi.cfg` or equivalent:

  • `[hadoop] command`: Path to hadoop binary (default: `hadoop`)
  • `[hadoop] version`: Hadoop version variant (default: `cdh4`, options: `cdh3`, `apache1`)
  • `[hadoop] streaming-jar`: Path to Hadoop Streaming JAR file
  • `[hadoop] python-executable`: Python binary on Hadoop nodes (default: `python`)
  • `[hadoop] scheduler`: YARN scheduler type (default: `fair`)

Environment variables:

  • `TMPDIR`: Used for temporary files during MapReduce job execution
  • `HADOOP_CONF_DIR`: Hadoop configuration directory (optional)
  • `HADOOP_USER_NAME`: User identity for HDFS operations (optional)

Quick Install

# Install Luigi with Hadoop/HDFS support
pip install luigi[cdh]

# Or for HDP distributions
pip install luigi[hdp]

Code Evidence

Hadoop CLI configuration from `luigi/contrib/hdfs/config.py:43-66`:

class hadoopcli(luigi.Config):
    command = luigi.Parameter(default="hadoop",
                              config_path=dict(section="hadoop", name="command"),
                              description='The hadoop command, will run split() on it, '
                                          'so you can pass something like "hadoop --param"')
    version = luigi.Parameter(default="cdh4",
                              config_path=dict(section="hadoop", name="version"),
                              description='Can also be cdh3 or apache1')

def load_hadoop_cmd():
    return hadoopcli().command.split()

def get_configured_hadoop_version():
    """
    CDH4 (hadoop 2+) has a slightly different syntax for interacting with hdfs
    via the command line.
    """
    return hadoopcli().version.lower()

Streaming JAR usage from `luigi/contrib/hadoop.py:467`:

arglist = luigi.contrib.hdfs.load_hadoop_cmd() + ['jar', self.streaming_jar]

YARN/MapReduce job control from `luigi/contrib/hadoop.py:221-224`:

subprocess.call(['yarn', 'application', '-kill', self.application_id])
# ...
subprocess.call(['mapred', 'job', '-kill', self.job_id])

Deprecation of core.tmp-dir from `luigi/contrib/hadoop.py:430-433`:

base_tmp_dir = configuration.get_config().get('core', 'tmp-dir', None)
if base_tmp_dir:
    warnings.warn("The core.tmp-dir configuration item is deprecated, "
                  "please use the TMPDIR environment variable...")

Common Errors

Error Message Cause Solution
`FileNotFoundError: hadoop: command not found` Hadoop CLI not on PATH Install Hadoop or set `[hadoop] command` in luigi.cfg
`HDFSCliError` HDFS command failed Check HDFS connectivity and permissions
`streaming-jar not configured` Missing Streaming JAR path Set `[hadoop] streaming-jar` in luigi.cfg
Renaming multiple files not atomic Known limitation of HDFS rename See `luigi/contrib/hdfs/hadoopcli_clients.py:96` warning

Compatibility Notes

  • CDH4 vs CDH3: Hadoop 2+ (CDH4) uses different CLI syntax than CDH3/Apache1. Set `[hadoop] version` accordingly.
  • WebHDFS: An alternative to the CLI client; requires the `hdfs` Python package and WebHDFS enabled on the cluster.
  • HDFS rename: Renaming multiple files at once is not atomic when using the CLI client. This is a known limitation documented in the code.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment