Heuristic:Spotify Luigi PySpark Task Serialization

Knowledge Sources	Spotify Luigi PySpark
Domains	Big_Data, Distributed_Computing
Last Updated	2026-02-10 07:00 GMT

Overview

Remote execution pattern where PySpark tasks are serialized via pickle and submitted to Spark cluster nodes for execution.

Description

Luigi's `PySparkTask` class uses Python's `pickle` module to serialize the entire task instance (including all parameters and configuration) into a file that is submitted to the Spark cluster via `spark-submit`. The `pyspark_runner.py` module runs on the remote Spark node, deserializes (unpickles) the task object, and executes its `main()` method. This pattern requires that all Python modules referenced by the task class are available on the remote Spark workers, either pre-installed or distributed via `--py-files`.

Usage

This pattern is automatically used when subclassing `PySparkTask`. Understand the serialization constraints when designing PySpark tasks: avoid unpicklable objects (file handles, database connections, lambda functions with closures) in task instance variables.

The Insight (Rule of Thumb)

Action: Keep `PySparkTask` subclasses serializable: use only pickle-compatible attributes. Distribute required Python modules via `py_files` property or `--py-files` argument. Ensure the pickle protocol matches between driver and worker Python versions.
Value: Enables executing complex Luigi task logic on remote Spark cluster nodes without manual code distribution.
Trade-off: Serialization adds startup overhead. Task classes and all dependencies must be available for unpickling on remote nodes. Version mismatches between driver and worker Python can cause unpickling failures.

Reasoning

In a distributed Spark cluster, the driver (where Luigi runs) and the executors (where computation happens) are different machines. The task's `main()` method needs to run on the executors, but Luigi tasks are defined on the driver. The pickle serialization bridges this gap: the entire task object is serialized on the driver, sent to executors, and reconstructed there.

The pickle protocol version is configurable to handle heterogeneous Python environments:

return configuration.get_config().getint('spark', 'pickle-protocol', pickle.DEFAULT_PROTOCOL)

Code Evidence

PySpark runner entry point from `luigi/contrib/pyspark_runner.py:1-20`:

"""
The pyspark program.

This module will be run by spark-submit for PySparkTask jobs.

The first argument is a path to the pickled instance of the PySparkTask,
other arguments are the ones returned by PySparkTask.app_options()
"""

import pickle

Task deserialization from `luigi/contrib/pyspark_runner.py:103`:

self.job = pickle.load(fd)

Pickle protocol configuration from `luigi/contrib/spark.py:297`:

return configuration.get_config().getint(
    'spark', 'pickle-protocol', pickle.DEFAULT_PROTOCOL)

py-packages distribution from `luigi/contrib/spark.py:286`:

packages = configuration.get_config().get('spark', 'py-packages', None)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment