Implementation:Spotify Luigi HDFS Config
Domains: Pipeline_Orchestration, Big_Data
Last Updated: 2026-02-10 00:00 GMT
Overview
Concrete tool for configuring the Hadoop environment for HDFS and MapReduce operations provided by Luigi.
Description
The luigi.contrib.hdfs.config module defines two Luigi Config classes and several helper functions that centralize all Hadoop and HDFS settings. The hdfs config class holds cluster-level settings such as the HDFS client type, NameNode host and port, and the temporary directory path. The hadoopcli config class holds the Hadoop CLI command string and the distribution version (CDH3, CDH4, or Apache). Helper functions like load_hadoop_cmd(), get_configured_hadoop_version(), get_configured_hdfs_client(), and tmppath() provide convenient access to these configuration values and generate safe temporary paths on HDFS.
Usage
Use this module when:
- You need to obtain the Hadoop command as a list suitable for
subprocesscalls. - You need to determine which HDFS client to instantiate (hadoopcli, snakebite, webhdfs).
- You need to generate a unique temporary path on HDFS for intermediate pipeline data.
- You are writing a custom job runner and need to read the Hadoop version or NameNode address from configuration.
Code Reference
Source Location
luigi/contrib/hdfs/config.py, lines 32--121.
Key Signatures
class hdfs(luigi.Config):
client_version = luigi.IntParameter(default=None)
namenode_host = luigi.OptionalParameter(default=None)
namenode_port = luigi.IntParameter(default=None)
client = luigi.Parameter(default='hadoopcli')
tmp_dir = luigi.OptionalParameter(default=None,
config_path=dict(section='core', name='hdfs-tmp-dir'))
class hadoopcli(luigi.Config):
command = luigi.Parameter(default="hadoop",
config_path=dict(section="hadoop", name="command"))
version = luigi.Parameter(default="cdh4",
config_path=dict(section="hadoop", name="version"))
def load_hadoop_cmd() -> list:
...
def get_configured_hadoop_version() -> str:
...
def get_configured_hdfs_client() -> str:
...
def tmppath(path=None, include_unix_username=True) -> str:
...
Import
from luigi.contrib.hdfs.config import hdfs, hadoopcli, load_hadoop_cmd, tmppath
# or via the hdfs package:
import luigi.contrib.hdfs
luigi.contrib.hdfs.load_hadoop_cmd()
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
| [hdfs] client | str (config) | HDFS client backend to use: hadoopcli, snakebite, or webhdfs
|
| [hdfs] namenode_host | str (config) | Hostname of the HDFS NameNode |
| [hdfs] namenode_port | int (config) | Port of the HDFS NameNode |
| [hdfs] tmp_dir | str (config) | HDFS temporary directory override (also reads core.hdfs-tmp-dir)
|
| [hadoop] command | str (config) | Hadoop CLI command string (e.g., "hadoop" or "hadoop --config /etc/hadoop")
|
| [hadoop] version | str (config) | Hadoop distribution version: cdh4, cdh3, or apache1
|
| path (tmppath) | str or None | Optional target path for which to generate a colocated temporary path |
Outputs
| Name | Type | Description |
|---|---|---|
| load_hadoop_cmd() | list[str] | The Hadoop command split into a list suitable for subprocess calls
|
| get_configured_hadoop_version() | str | Lowercase Hadoop version string |
| get_configured_hdfs_client() | str | Configured HDFS client name |
| tmppath() | str | A unique temporary HDFS path incorporating random suffix and optional username |
Usage Examples
Example 1: Configuring Hadoop via luigi.cfg
Place the following in /etc/luigi/client.cfg or luigi.cfg:
[hdfs]
client = hadoopcli
namenode_host = namenode.example.com
namenode_port = 8020
[hadoop]
command = /usr/bin/hadoop
version = cdh4
Example 2: Loading the Hadoop command in code
from luigi.contrib.hdfs.config import load_hadoop_cmd
# Returns e.g. ['/usr/bin/hadoop']
cmd = load_hadoop_cmd()
# Use it to build a subprocess call:
import subprocess
subprocess.call(cmd + ['fs', '-ls', '/user/data/'])
Example 3: Generating a temporary HDFS path
from luigi.contrib.hdfs.config import tmppath
# Generate a temp path colocated with the target:
temp = tmppath('/data/output/2026-02-10')
# Result: something like '/tmp/myuser/data/output/2026-02-10-luigitemp-482937152'
# Generate an anonymous temp path:
anon_temp = tmppath()
# Result: something like '/tmp/myuser/luigitemp-719204831'