Principle:Spotify Luigi HDFS Data Sources

Template:Knowledge Source Template:Knowledge Source

Domains: Pipeline_Orchestration, Big_Data

Last Updated: 2026-02-10 00:00 GMT

Overview

HDFS Data Sources is the practice of declaring input and output locations on a Hadoop Distributed File System so that pipeline tasks can verify existence, read content, and write results in a reliable and idempotent manner.

Description

In a distributed data pipeline, every processing step must clearly declare where its input data resides and where its output data will be written. When the storage layer is HDFS, these declarations carry additional concerns beyond simple file paths:

Existence checking -- Before a task runs, the orchestrator checks whether the output already exists on HDFS, enabling idempotent re-runs. If the output is present, the task is skipped.
Atomic writes -- Writing directly to the final output path risks leaving partial data if a job fails. A common pattern is to write to a temporary location and atomically move the result into place upon success.
Directory-based outputs -- MapReduce jobs produce output as a directory of part files rather than a single file. A special "flag file" (typically _SUCCESS) inside the directory signals that all part files are complete.
Format negotiation -- Data on HDFS may be stored in plain text, compressed formats, or custom encodings. The data source declaration should pair a path with a format specification so that readers and writers use the correct codec.
Temporary data lifecycle -- Intermediate datasets created during multi-step pipelines should be placed in designated temporary locations and cleaned up after downstream tasks consume them.

By treating HDFS locations as first-class, typed objects rather than bare strings, the pipeline gains compile-time-like guarantees about path validity and format consistency.

Usage

Use HDFS Data Sources when:

Defining the output of a MapReduce or Spark job that writes to HDFS.
Declaring an external dataset on HDFS as an input dependency for a pipeline task.
Building multi-step pipelines where one task's HDFS output becomes the next task's input.
You need directory-level completion semantics using flag files like _SUCCESS.
Creating temporary or intermediate HDFS locations that should be cleaned up automatically.

Theoretical Basis

HDFS Data Sources rests on several distributed systems principles:

Immutable output contracts -- In the MapReduce paradigm, job outputs are write-once. Once a directory and its success marker exist, the data is considered final. This immutability simplifies consistency reasoning across pipeline stages.
Idempotency through existence checks -- The orchestrator queries HDFS metadata (NameNode) to determine whether an output path exists. If it does, the task is a no-op. This property is essential for fault-tolerant re-execution: a pipeline can be restarted from any point without duplicating work.
Flag-file completion protocol -- Hadoop MapReduce writes a _SUCCESS file only after all reducers finish and all part files are committed. Checking for this flag, rather than merely checking for the directory, avoids reading incomplete data.
Atomic rename on HDFS -- HDFS supports O(1) rename operations within the same filesystem namespace. Writing to a temporary path and renaming to the final path provides atomicity: consumers either see the complete output or nothing at all.
Format piping -- Data formats can be composed through a pipe operator, chaining an in-memory format (e.g., UTF-8 text) with an HDFS transport format (e.g., Plain HDFS writer). This composability follows the Decorator pattern and keeps format logic orthogonal to path logic.

Related Pages

Implementation:Spotify_Luigi_HdfsTarget

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment