Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Spotify Luigi Spark Job Definition

From Leeroopedia


Template:Knowledge Sources Template:Domains Template:Last Updated

Overview

Spark Job Definition is the practice of encapsulating an Apache Spark application -- whether it is an external JAR/script or an inline Python function -- as a declarative task within a pipeline orchestrator, so that it participates in dependency resolution and scheduling like any other pipeline step.

Description

A pipeline orchestrator manages many kinds of tasks: database queries, file transfers, API calls, and distributed compute jobs. Spark jobs are among the most resource-intensive steps in a data pipeline, and they require a distinct submission mechanism (the spark-submit launcher). Spark Job Definition is the design pattern that bridges these two worlds by expressing a Spark application as a first-class task object with:

  • An application artifact -- either an external file (a compiled .jar for JVM applications or a .py script for PySpark) or an inline Python method that contains the Spark logic.
  • Application options -- a list of positional arguments passed to the Spark application's main entry point. These typically include input and output paths derived from the task's declared dependencies and targets.
  • Dependency and output declarations -- standard requires() and output() methods that let the scheduler resolve the task's position in the overall DAG.
  • An entry-point class (JVM only) -- the fully qualified class name whose main method the JVM invokes.

There are two fundamental styles of Spark Job Definition:

  1. External application -- The Spark logic lives in a separately compiled or written artifact. The orchestrator only needs to know the artifact path and the arguments to pass. This is appropriate for large, tested codebases where the Spark application has its own build lifecycle.
  2. Inline application -- The Spark logic is written directly inside the orchestrator task class as a Python method. The orchestrator serialises the task instance (including its parameters), ships the serialised blob to the Spark driver, and the driver deserialises and invokes it. This is convenient for small, self-contained PySpark jobs that benefit from tight coupling with the orchestrator's parameter system.

Usage

Use Spark Job Definition when:

  • You need a Spark job to participate in a larger dependency DAG managed by a pipeline orchestrator.
  • You want to pass pipeline parameters (dates, paths, thresholds) through to the Spark application's arguments automatically.
  • You have either a pre-built JAR / Python script, or you prefer to write PySpark logic inline within the orchestrator's task class.
  • You need the orchestrator to skip the Spark job when its output target already exists.

Theoretical Basis

Spark Job Definition applies the Template Method pattern:

  1. A base class provides the skeleton of the Spark submission process: resolve configuration, build the spark-submit command line, append the application artifact and its arguments, then invoke the process.
  2. Subclasses fill in the variable parts: which application to run (app), what arguments to pass (app_options()), and what constitutes task completion (output()).

For inline PySpark jobs, an additional mechanism is layered on top:

  1. The orchestrator pickles (serialises) the task instance to a temporary file on disk.
  2. The spark-submit command is directed at a generic runner script (pyspark_runner.py) with the pickle file as its first argument.
  3. On the driver, the runner deserialises the task instance, creates a SparkContext (or SparkSession), and calls the task's main() method.
  4. Any Python packages declared as dependencies are compressed, uploaded, and added to the Spark context via sc.addPyFile().

This two-layer design means that a single orchestrator class hierarchy can handle both external-artifact and inline-Python styles of Spark job definition with the same scheduling, dependency resolution, and configuration infrastructure.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment