Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Spotify Luigi HDFS Config

From Leeroopedia


Template:Knowledge Source

Domains: Pipeline_Orchestration, Big_Data

Last Updated: 2026-02-10 00:00 GMT

Overview

Concrete tool for configuring the Hadoop environment for HDFS and MapReduce operations provided by Luigi.

Description

The luigi.contrib.hdfs.config module defines two Luigi Config classes and several helper functions that centralize all Hadoop and HDFS settings. The hdfs config class holds cluster-level settings such as the HDFS client type, NameNode host and port, and the temporary directory path. The hadoopcli config class holds the Hadoop CLI command string and the distribution version (CDH3, CDH4, or Apache). Helper functions like load_hadoop_cmd(), get_configured_hadoop_version(), get_configured_hdfs_client(), and tmppath() provide convenient access to these configuration values and generate safe temporary paths on HDFS.

Usage

Use this module when:

  • You need to obtain the Hadoop command as a list suitable for subprocess calls.
  • You need to determine which HDFS client to instantiate (hadoopcli, snakebite, webhdfs).
  • You need to generate a unique temporary path on HDFS for intermediate pipeline data.
  • You are writing a custom job runner and need to read the Hadoop version or NameNode address from configuration.

Code Reference

Source Location

luigi/contrib/hdfs/config.py, lines 32--121.

Key Signatures

class hdfs(luigi.Config):
    client_version = luigi.IntParameter(default=None)
    namenode_host = luigi.OptionalParameter(default=None)
    namenode_port = luigi.IntParameter(default=None)
    client = luigi.Parameter(default='hadoopcli')
    tmp_dir = luigi.OptionalParameter(default=None,
        config_path=dict(section='core', name='hdfs-tmp-dir'))

class hadoopcli(luigi.Config):
    command = luigi.Parameter(default="hadoop",
        config_path=dict(section="hadoop", name="command"))
    version = luigi.Parameter(default="cdh4",
        config_path=dict(section="hadoop", name="version"))

def load_hadoop_cmd() -> list:
    ...

def get_configured_hadoop_version() -> str:
    ...

def get_configured_hdfs_client() -> str:
    ...

def tmppath(path=None, include_unix_username=True) -> str:
    ...

Import

from luigi.contrib.hdfs.config import hdfs, hadoopcli, load_hadoop_cmd, tmppath
# or via the hdfs package:
import luigi.contrib.hdfs
luigi.contrib.hdfs.load_hadoop_cmd()

I/O Contract

Inputs

Name Type Description
[hdfs] client str (config) HDFS client backend to use: hadoopcli, snakebite, or webhdfs
[hdfs] namenode_host str (config) Hostname of the HDFS NameNode
[hdfs] namenode_port int (config) Port of the HDFS NameNode
[hdfs] tmp_dir str (config) HDFS temporary directory override (also reads core.hdfs-tmp-dir)
[hadoop] command str (config) Hadoop CLI command string (e.g., "hadoop" or "hadoop --config /etc/hadoop")
[hadoop] version str (config) Hadoop distribution version: cdh4, cdh3, or apache1
path (tmppath) str or None Optional target path for which to generate a colocated temporary path

Outputs

Name Type Description
load_hadoop_cmd() list[str] The Hadoop command split into a list suitable for subprocess calls
get_configured_hadoop_version() str Lowercase Hadoop version string
get_configured_hdfs_client() str Configured HDFS client name
tmppath() str A unique temporary HDFS path incorporating random suffix and optional username

Usage Examples

Example 1: Configuring Hadoop via luigi.cfg

Place the following in /etc/luigi/client.cfg or luigi.cfg:

[hdfs]
client = hadoopcli
namenode_host = namenode.example.com
namenode_port = 8020

[hadoop]
command = /usr/bin/hadoop
version = cdh4

Example 2: Loading the Hadoop command in code

from luigi.contrib.hdfs.config import load_hadoop_cmd

# Returns e.g. ['/usr/bin/hadoop']
cmd = load_hadoop_cmd()
# Use it to build a subprocess call:
import subprocess
subprocess.call(cmd + ['fs', '-ls', '/user/data/'])

Example 3: Generating a temporary HDFS path

from luigi.contrib.hdfs.config import tmppath

# Generate a temp path colocated with the target:
temp = tmppath('/data/output/2026-02-10')
# Result: something like '/tmp/myuser/data/output/2026-02-10-luigitemp-482937152'

# Generate an anonymous temp path:
anon_temp = tmppath()
# Result: something like '/tmp/myuser/luigitemp-719204831'

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment