Principle:Spotify Luigi Spark Pipeline Execution

Template:Knowledge Sources Template:Domains Template:Last Updated

Overview

Spark Pipeline Execution is the orchestration pattern in which a pipeline scheduler assembles a complete spark-submit invocation, launches it as a subprocess, monitors its lifecycle, and integrates the result back into the broader dependency graph.

Description

Defining a Spark job and configuring its resources are necessary but not sufficient -- the job must ultimately be executed. Spark Pipeline Execution covers the full lifecycle from the moment the scheduler decides that a task is ready to run, through the subprocess launch, to the final success-or-failure determination. The key phases are:

1. Command Assembly

The orchestrator combines two independently constructed halves of the command line:

Spark command -- The spark-submit binary, cluster manager flags, resource flags, configuration properties, and supplementary file arguments.
App command -- The application artifact path followed by positional arguments specific to the business logic (input paths, output paths, parameters).

The full argument list is the concatenation of these two halves.

2. Environment Preparation

Before spawning the process, the orchestrator sets up environment variables that spark-submit and the underlying Hadoop stack expect, such as HADOOP_CONF_DIR and HADOOP_USER_NAME.

3. Subprocess Launch and Monitoring

The assembled command is executed via the operating system's process-spawning mechanism. The orchestrator captures both standard output and standard error. For Spark applications, stderr is particularly important because Spark writes its progress logs (stage completion, shuffle statistics, and the Spark UI URL) to stderr.

4. Tracking URL Extraction

In production environments, operators need to inspect the Spark UI while a job is running. The orchestrator scans the stderr stream for a URL pattern and, when found, registers it as the task's tracking URL so that it is visible in the scheduler's web interface.

5. Completion and Error Handling

When the subprocess terminates, its exit code is checked. A zero exit code means the Spark application completed successfully; any non-zero code raises an error that includes the captured stdout and stderr for debugging. On success, the scheduler marks the task complete and makes its output targets available to downstream tasks.

6. Serialisation Bridge (Inline PySpark)

For inline PySpark jobs, an additional pre-execution step occurs: the orchestrator serialises (pickles) the task instance to disk, and the spark-submit command is directed at a generic runner script that deserialises the task on the Spark driver, creates a SparkContext or SparkSession, and calls the task's main() method. After execution, temporary files are cleaned up.

Usage

Spark Pipeline Execution applies whenever:

A pipeline step involves launching a Spark application as a subprocess.
You need the orchestrator to capture Spark's stderr for logging and tracking URL extraction.
The orchestrator must propagate Spark job failures as task failures, blocking downstream work.
Inline PySpark logic must be serialised, shipped to a Spark driver, deserialised, and invoked.

Theoretical Basis

Spark Pipeline Execution combines two well-established patterns:

External Program Delegation

The orchestrator does not embed a Spark runtime. Instead, it delegates to the spark-submit launcher, which handles class loading, cluster negotiation, and job scheduling. The orchestrator's responsibility is limited to:

Constructing the correct argument vector.
Spawning the process.
Monitoring its exit status.

This is an application of the Adapter pattern: the orchestrator adapts its internal task model to the external spark-submit interface.

Pickle-based Remote Invocation (for inline PySpark)

When the Spark logic is defined as a method on the task object rather than in a separate file, the task must be transported to the Spark driver. The algorithm is:

Serialise -- Use Python's pickle module to serialise the task instance (with all its parameter values) to a byte stream.
Ship -- Pass the pickle file as the first argument to a generic runner script (pyspark_runner.py), which spark-submit distributes to the driver.
Deserialise -- On the driver, the runner reads the pickle file, reconstructs the task instance, and calls setup(conf) followed by main(sc, *args).
Clean up -- After execution, the temporary directory containing the pickle and module copy is deleted.

A special case handles tasks defined in __main__: the pickle byte stream is post-processed to replace the __main__ module reference with the actual script filename, ensuring correct deserialisation on the remote driver.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment