Environment:Spotify Luigi Apache Spark
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Big_Data, Distributed_Computing |
| Last Updated | 2026-02-10 07:00 GMT |
Overview
Apache Spark environment with `spark-submit` binary and PySpark support for distributed data processing via Luigi.
Description
This environment provides the Apache Spark dependencies required to run Luigi's Spark contrib module. It requires a configured `spark-submit` binary, either locally installed or accessible via a cluster manager (YARN, Mesos, Kubernetes). The environment supports both Java/Scala Spark jobs (`SparkSubmitTask`) and PySpark jobs (`PySparkTask`). PySpark tasks are serialized via pickle and submitted to the Spark cluster for remote execution.
Usage
Use this environment for any pipeline that runs distributed data processing on Apache Spark. It is required for the Spark_Processing_Pipeline workflow and any task using `SparkSubmitTask` or `PySparkTask`.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux, macOS | Spark runs on JVM; Windows possible but not recommended |
| Java | JDK 8 or 11 | Required by Spark runtime |
| Spark | Apache Spark installation | spark-submit must be on PATH or configured |
| Network | Access to cluster manager | YARN, Mesos, Standalone, or local mode |
Dependencies
System Packages
- `spark-submit` binary (on PATH or configured via `[spark] spark-submit`)
- Apache Spark installation
- Java JDK 8 or 11
Python Packages
- `luigi` (core)
- `pyspark` (for PySparkTask only)
Credentials
The following configuration should be set in `luigi.cfg`:
- `[spark] spark-submit`: Path to spark-submit binary (default: `spark-submit`)
- `[spark] master`: Spark master URL (e.g., `yarn`, `local[*]`, `spark://host:7077`)
- `[spark] deploy-mode`: Deployment mode (`client` or `cluster`)
- `[spark] hadoop-conf-dir`: Hadoop configuration directory for YARN mode
- `[spark] py-packages`: Python packages to distribute to Spark nodes
Environment variables:
- `HADOOP_CONF_DIR`: Hadoop configuration directory (used in YARN mode)
- `HADOOP_USER_NAME`: User identity for Hadoop access
- `spark.pyspark.python`: Python binary on Spark worker nodes
- `spark.pyspark.driver.python`: Python binary on the driver
Quick Install
# Install Luigi (Spark support is built-in, no extra pip dependency)
pip install luigi
# PySpark (if running PySparkTask)
pip install pyspark
Code Evidence
Spark-submit resolution from `luigi/contrib/spark.py:90-92`:
@property
def spark_submit(self):
return configuration.get_config().get(self.spark_version, 'spark-submit', 'spark-submit')
Environment variables setup from `luigi/contrib/spark.py:190-196`:
def get_environment(self):
env = os.environ.copy()
for prop in ('HADOOP_CONF_DIR', 'HADOOP_USER_NAME'):
var = getattr(self, prop.lower(), None)
if var:
env[prop] = var
return env
PySpark configuration from `luigi/contrib/spark.py:122-127`:
if self.pyspark_python:
conf['spark.pyspark.python'] = self.pyspark_python
if self.pyspark_driver_python:
conf['spark.pyspark.driver.python'] = self.pyspark_driver_python
Pickle protocol configuration from `luigi/contrib/spark.py:297`:
return configuration.get_config().getint('spark', 'pickle-protocol', pickle.DEFAULT_PROTOCOL)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `FileNotFoundError: spark-submit: command not found` | Spark not installed or not on PATH | Install Spark or set `[spark] spark-submit` in luigi.cfg |
| `Py4JJavaError` | Java exception during Spark execution | Check Spark logs for root cause |
| `pickle.UnpicklingError` | Task class not available on Spark nodes | Ensure task module is distributed via `--py-files` |
| `HADOOP_CONF_DIR not set` | Missing Hadoop config for YARN mode | Set HADOOP_CONF_DIR environment variable |
Compatibility Notes
- Local mode: Set `master=local[*]` for development/testing without a cluster.
- YARN mode: Requires `HADOOP_CONF_DIR` to be set and accessible.
- PySpark serialization: `PySparkTask` instances are pickled and sent to worker nodes. All imported modules must be available on the remote nodes.
- Spark version sections: Configuration can be version-specific by using `[spark]` or custom section names via the `spark_version` property.