Environment:Spotify Luigi Apache Spark

Knowledge Sources	Spotify Luigi Apache Spark
Domains	Infrastructure, Big_Data, Distributed_Computing
Last Updated	2026-02-10 07:00 GMT

Overview

Apache Spark environment with `spark-submit` binary and PySpark support for distributed data processing via Luigi.

Description

This environment provides the Apache Spark dependencies required to run Luigi's Spark contrib module. It requires a configured `spark-submit` binary, either locally installed or accessible via a cluster manager (YARN, Mesos, Kubernetes). The environment supports both Java/Scala Spark jobs (`SparkSubmitTask`) and PySpark jobs (`PySparkTask`). PySpark tasks are serialized via pickle and submitted to the Spark cluster for remote execution.

Usage

Use this environment for any pipeline that runs distributed data processing on Apache Spark. It is required for the Spark_Processing_Pipeline workflow and any task using `SparkSubmitTask` or `PySparkTask`.

System Requirements

Category	Requirement	Notes
OS	Linux, macOS	Spark runs on JVM; Windows possible but not recommended
Java	JDK 8 or 11	Required by Spark runtime
Spark	Apache Spark installation	spark-submit must be on PATH or configured
Network	Access to cluster manager	YARN, Mesos, Standalone, or local mode

Dependencies

System Packages

`spark-submit` binary (on PATH or configured via `[spark] spark-submit`)
Apache Spark installation
Java JDK 8 or 11

Python Packages

`luigi` (core)
`pyspark` (for PySparkTask only)

Credentials

The following configuration should be set in `luigi.cfg`:

`[spark] spark-submit`: Path to spark-submit binary (default: `spark-submit`)
`[spark] master`: Spark master URL (e.g., `yarn`, `local[*]`, `spark://host:7077`)
`[spark] deploy-mode`: Deployment mode (`client` or `cluster`)
`[spark] hadoop-conf-dir`: Hadoop configuration directory for YARN mode
`[spark] py-packages`: Python packages to distribute to Spark nodes

Environment variables:

`HADOOP_CONF_DIR`: Hadoop configuration directory (used in YARN mode)
`HADOOP_USER_NAME`: User identity for Hadoop access
`spark.pyspark.python`: Python binary on Spark worker nodes
`spark.pyspark.driver.python`: Python binary on the driver

Quick Install

# Install Luigi (Spark support is built-in, no extra pip dependency)
pip install luigi

# PySpark (if running PySparkTask)
pip install pyspark

Code Evidence

Spark-submit resolution from `luigi/contrib/spark.py:90-92`:

@property
def spark_submit(self):
    return configuration.get_config().get(self.spark_version, 'spark-submit', 'spark-submit')

Environment variables setup from `luigi/contrib/spark.py:190-196`:

def get_environment(self):
    env = os.environ.copy()
    for prop in ('HADOOP_CONF_DIR', 'HADOOP_USER_NAME'):
        var = getattr(self, prop.lower(), None)
        if var:
            env[prop] = var
    return env

PySpark configuration from `luigi/contrib/spark.py:122-127`:

if self.pyspark_python:
    conf['spark.pyspark.python'] = self.pyspark_python
if self.pyspark_driver_python:
    conf['spark.pyspark.driver.python'] = self.pyspark_driver_python

Pickle protocol configuration from `luigi/contrib/spark.py:297`:

return configuration.get_config().getint('spark', 'pickle-protocol', pickle.DEFAULT_PROTOCOL)

Common Errors

Error Message	Cause	Solution
`FileNotFoundError: spark-submit: command not found`	Spark not installed or not on PATH	Install Spark or set `[spark] spark-submit` in luigi.cfg
`Py4JJavaError`	Java exception during Spark execution	Check Spark logs for root cause
`pickle.UnpicklingError`	Task class not available on Spark nodes	Ensure task module is distributed via `--py-files`
`HADOOP_CONF_DIR not set`	Missing Hadoop config for YARN mode	Set HADOOP_CONF_DIR environment variable

Compatibility Notes

Local mode: Set `master=local[*]` for development/testing without a cluster.
YARN mode: Requires `HADOOP_CONF_DIR` to be set and accessible.
PySpark serialization: `PySparkTask` instances are pickled and sent to worker nodes. All imported modules must be available on the remote nodes.
Spark version sections: Configuration can be version-specific by using `[spark]` or custom section names via the `spark_version` property.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment