Principle:Spotify Luigi Dependency Chaining

Overview

Dependency chaining is the technique of building a directed acyclic graph (DAG) of work by having each unit of work declare the other units it depends on.

Description

In a multi-step data pipeline, tasks rarely stand alone. A transformation task depends on an ingestion task; an aggregation task depends on several transformation tasks; a reporting task depends on the aggregation. These relationships form a directed acyclic graph where each edge means "must complete before."

Dependency chaining is the mechanism by which this graph is constructed declaratively: each task states its own upstream requirements, and the framework assembles the full graph by recursively traversing those declarations. The pipeline author never has to specify the global execution order -- it emerges automatically from the local dependency declarations.

This approach provides several benefits:

Modularity -- Each task only knows about its immediate upstream dependencies, not the entire pipeline topology.
Automatic ordering -- The scheduler resolves the full execution order by walking the dependency graph.
Incremental execution -- Only tasks whose outputs are missing (and their transitive dependencies) are executed.
Reusability -- The same task class can appear in multiple pipelines with different parameterizations.

A critical challenge in dependency chaining is parameter propagation. When task C depends on task B, which depends on task A, the parameters of A often need to flow through B to reach C. Without tooling support, this leads to repetitive boilerplate (the "parameter explosion" problem). Good frameworks provide mechanisms -- such as parameter inheritance decorators or clone methods -- to propagate parameters along dependency chains without manual repetition.

Usage

Use dependency chaining when:

Your pipeline has multiple steps with clear data-flow relationships.
You want the execution framework to determine which tasks need to run based on missing outputs.
You need to reuse the same task definitions across different pipeline configurations.
You want to avoid centralized, brittle orchestration scripts that manually specify execution order.

Theoretical Basis

Dependency chaining constructs a DAG through recursive resolution:

FUNCTION resolve_dag(task, visited):
    IF task IN visited:
        RETURN  -- already processed, no cycles allowed

    visited.ADD(task)

    FOR EACH dependency IN task.requires():
        resolve_dag(dependency, visited)

    SCHEDULE(task)  -- all dependencies are now scheduled ahead of this task

The scheduler then executes tasks in topological order: a task only runs when all of its predecessors are complete. This is a direct application of topological sorting on a DAG.

For parameter propagation, the pattern relies on a clone operation:

FUNCTION clone(source_task, target_class):
    common_params = INTERSECTION(source_task.parameters, target_class.parameters)
    RETURN target_class(**common_params)

This clone operation transfers shared parameters between task classes without requiring the downstream task to redundantly declare upstream parameters, solving the parameter explosion problem.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment