Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Spotify Luigi Apache Spark

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Big_Data, Distributed_Computing
Last Updated 2026-02-10 07:00 GMT

Overview

Apache Spark environment with `spark-submit` binary and PySpark support for distributed data processing via Luigi.

Description

This environment provides the Apache Spark dependencies required to run Luigi's Spark contrib module. It requires a configured `spark-submit` binary, either locally installed or accessible via a cluster manager (YARN, Mesos, Kubernetes). The environment supports both Java/Scala Spark jobs (`SparkSubmitTask`) and PySpark jobs (`PySparkTask`). PySpark tasks are serialized via pickle and submitted to the Spark cluster for remote execution.

Usage

Use this environment for any pipeline that runs distributed data processing on Apache Spark. It is required for the Spark_Processing_Pipeline workflow and any task using `SparkSubmitTask` or `PySparkTask`.

System Requirements

Category Requirement Notes
OS Linux, macOS Spark runs on JVM; Windows possible but not recommended
Java JDK 8 or 11 Required by Spark runtime
Spark Apache Spark installation spark-submit must be on PATH or configured
Network Access to cluster manager YARN, Mesos, Standalone, or local mode

Dependencies

System Packages

  • `spark-submit` binary (on PATH or configured via `[spark] spark-submit`)
  • Apache Spark installation
  • Java JDK 8 or 11

Python Packages

  • `luigi` (core)
  • `pyspark` (for PySparkTask only)

Credentials

The following configuration should be set in `luigi.cfg`:

  • `[spark] spark-submit`: Path to spark-submit binary (default: `spark-submit`)
  • `[spark] master`: Spark master URL (e.g., `yarn`, `local[*]`, `spark://host:7077`)
  • `[spark] deploy-mode`: Deployment mode (`client` or `cluster`)
  • `[spark] hadoop-conf-dir`: Hadoop configuration directory for YARN mode
  • `[spark] py-packages`: Python packages to distribute to Spark nodes

Environment variables:

  • `HADOOP_CONF_DIR`: Hadoop configuration directory (used in YARN mode)
  • `HADOOP_USER_NAME`: User identity for Hadoop access
  • `spark.pyspark.python`: Python binary on Spark worker nodes
  • `spark.pyspark.driver.python`: Python binary on the driver

Quick Install

# Install Luigi (Spark support is built-in, no extra pip dependency)
pip install luigi

# PySpark (if running PySparkTask)
pip install pyspark

Code Evidence

Spark-submit resolution from `luigi/contrib/spark.py:90-92`:

@property
def spark_submit(self):
    return configuration.get_config().get(self.spark_version, 'spark-submit', 'spark-submit')

Environment variables setup from `luigi/contrib/spark.py:190-196`:

def get_environment(self):
    env = os.environ.copy()
    for prop in ('HADOOP_CONF_DIR', 'HADOOP_USER_NAME'):
        var = getattr(self, prop.lower(), None)
        if var:
            env[prop] = var
    return env

PySpark configuration from `luigi/contrib/spark.py:122-127`:

if self.pyspark_python:
    conf['spark.pyspark.python'] = self.pyspark_python
if self.pyspark_driver_python:
    conf['spark.pyspark.driver.python'] = self.pyspark_driver_python

Pickle protocol configuration from `luigi/contrib/spark.py:297`:

return configuration.get_config().getint('spark', 'pickle-protocol', pickle.DEFAULT_PROTOCOL)

Common Errors

Error Message Cause Solution
`FileNotFoundError: spark-submit: command not found` Spark not installed or not on PATH Install Spark or set `[spark] spark-submit` in luigi.cfg
`Py4JJavaError` Java exception during Spark execution Check Spark logs for root cause
`pickle.UnpicklingError` Task class not available on Spark nodes Ensure task module is distributed via `--py-files`
`HADOOP_CONF_DIR not set` Missing Hadoop config for YARN mode Set HADOOP_CONF_DIR environment variable

Compatibility Notes

  • Local mode: Set `master=local[*]` for development/testing without a cluster.
  • YARN mode: Requires `HADOOP_CONF_DIR` to be set and accessible.
  • PySpark serialization: `PySparkTask` instances are pickled and sent to worker nodes. All imported modules must be available on the remote nodes.
  • Spark version sections: Configuration can be version-specific by using `[spark]` or custom section names via the `spark_version` property.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment