Principle:Spotify Luigi Dependency Chaining
Overview
Dependency chaining is the technique of building a directed acyclic graph (DAG) of work by having each unit of work declare the other units it depends on.
Description
In a multi-step data pipeline, tasks rarely stand alone. A transformation task depends on an ingestion task; an aggregation task depends on several transformation tasks; a reporting task depends on the aggregation. These relationships form a directed acyclic graph where each edge means "must complete before."
Dependency chaining is the mechanism by which this graph is constructed declaratively: each task states its own upstream requirements, and the framework assembles the full graph by recursively traversing those declarations. The pipeline author never has to specify the global execution order -- it emerges automatically from the local dependency declarations.
This approach provides several benefits:
- Modularity -- Each task only knows about its immediate upstream dependencies, not the entire pipeline topology.
- Automatic ordering -- The scheduler resolves the full execution order by walking the dependency graph.
- Incremental execution -- Only tasks whose outputs are missing (and their transitive dependencies) are executed.
- Reusability -- The same task class can appear in multiple pipelines with different parameterizations.
A critical challenge in dependency chaining is parameter propagation. When task C depends on task B, which depends on task A, the parameters of A often need to flow through B to reach C. Without tooling support, this leads to repetitive boilerplate (the "parameter explosion" problem). Good frameworks provide mechanisms -- such as parameter inheritance decorators or clone methods -- to propagate parameters along dependency chains without manual repetition.
Usage
Use dependency chaining when:
- Your pipeline has multiple steps with clear data-flow relationships.
- You want the execution framework to determine which tasks need to run based on missing outputs.
- You need to reuse the same task definitions across different pipeline configurations.
- You want to avoid centralized, brittle orchestration scripts that manually specify execution order.
Theoretical Basis
Dependency chaining constructs a DAG through recursive resolution:
FUNCTION resolve_dag(task, visited):
IF task IN visited:
RETURN -- already processed, no cycles allowed
visited.ADD(task)
FOR EACH dependency IN task.requires():
resolve_dag(dependency, visited)
SCHEDULE(task) -- all dependencies are now scheduled ahead of this task
The scheduler then executes tasks in topological order: a task only runs when all of its predecessors are complete. This is a direct application of topological sorting on a DAG.
For parameter propagation, the pattern relies on a clone operation:
FUNCTION clone(source_task, target_class):
common_params = INTERSECTION(source_task.parameters, target_class.parameters)
RETURN target_class(**common_params)
This clone operation transfers shared parameters between task classes without requiring the downstream task to redundantly declare upstream parameters, solving the parameter explosion problem.