Principle:Spotify Luigi Task Definition
Overview
Task definition is the practice of encapsulating a single, atomic unit of work within a pipeline as a self-describing object that declares its inputs, computation, and outputs.
Description
A data pipeline is composed of discrete steps, each of which transforms some input into some output. The task definition pattern formalizes each step as an object with a well-defined lifecycle:
- Declare dependencies -- The task states which other tasks (or external data sources) must be complete before it can execute.
- Execute computation -- The task performs its work: reading inputs, transforming data, and producing results.
- Publish output -- The task writes its results to a target location (a file, database table, etc.).
- Report completeness -- The task can answer the question "am I already done?" by checking whether its outputs exist.
This four-phase lifecycle turns each pipeline step into an idempotent, atomic unit. If a task's output already exists, it is not re-executed. If it fails partway through, partial outputs are not visible to downstream consumers (thanks to atomic write patterns). And because each task explicitly names its dependencies and outputs, the framework can automatically assemble the full execution graph.
The key insight is that a task is not simply a function -- it is a stateful contract. It carries parameters that identify which specific slice of data it operates on, and it exposes a standard interface that the scheduler can interrogate without running the task itself.
Usage
Use the task definition pattern whenever:
- You need to break a pipeline into individually schedulable, restartable units.
- Each step should be idempotent: safe to re-run without producing duplicate results.
- You want the framework to automatically determine which steps need to run based on the presence or absence of outputs.
- You need to parameterize pipeline steps (e.g., by date, region, or configuration variant).
Theoretical Basis
The task lifecycle can be described as a simple state machine:
FUNCTION execute_task(task):
IF task.complete():
RETURN ALREADY_DONE
FOR EACH dependency IN task.requires():
execute_task(dependency)
task.run()
IF task.complete():
RETURN SUCCESS
ELSE:
RETURN FAILURE -- output was not created as expected
The default completeness check relies on the output contract:
FUNCTION complete(task):
outputs = task.output()
IF outputs IS EMPTY:
RETURN False
RETURN ALL(output.exists() FOR output IN outputs)
This design follows the Hollywood Principle ("don't call us, we'll call you"): the task author defines what to do, and the framework decides when and whether to do it. It also embodies the Command Pattern from object-oriented design, where each task is a reified command object that can be queued, inspected, and executed by an external invoker (the scheduler/worker).