Principle:Spotify Luigi Pipeline Parameterization

Overview

Pipeline parameterization is the practice of making pipeline steps configurable through typed, validated parameters that control which data slice is processed, how computation behaves, and where outputs are written.

Description

A useful pipeline is not hard-coded to a single dataset or configuration. Instead, it is parameterized so that the same pipeline logic can be reused across different dates, regions, thresholds, or any other axis of variation. Each unit of work in the pipeline carries a set of parameters that together define which specific instance of that work is being performed.

Pipeline parameterization involves several design decisions:

Typed parameters -- Each parameter has a declared type (string, integer, date, list, dictionary, etc.). The framework parses raw string inputs (from the command line or configuration files) into the correct Python type and validates them.
Default values and resolution order -- Parameters can have defaults, and the framework resolves values through a priority chain: explicit constructor arguments, command-line flags, configuration file entries, and finally the coded default.
Task identity -- Parameters marked as significant contribute to the unique identity of a task instance. Two task instances with different significant parameter values are considered different tasks. Insignificant parameters (e.g., credentials, log verbosity) do not affect identity.
Serialization round-trip -- Every parameter value can be serialized to a string and parsed back, enabling command-line invocation, scheduler communication, and task ID generation.

Well-designed parameterization enables:

Incremental pipelines -- Parameterizing by date allows daily runs to process only new data.
A/B experimentation -- Parameterizing by model version or feature flag runs parallel pipelines with different configurations.
Reproducibility -- Every task instance is fully identified by its family name and parameter values, making runs traceable and repeatable.

Usage

Use pipeline parameterization when:

The same pipeline logic must process different data slices (e.g., daily batches, regional subsets).
You need to pass configuration values (thresholds, file paths, feature flags) into tasks in a validated, type-safe manner.
You want the scheduler to distinguish between task instances that operate on different data.
You need parameters to be settable from the command line, configuration files, or code.

Theoretical Basis

The parameter resolution algorithm follows a cascading priority chain:

FUNCTION resolve_parameter(task_family, param_name, param_definition):
    # Priority 1: Explicit value passed to constructor
    IF param_name IN constructor_kwargs:
        RETURN parse_and_validate(constructor_kwargs[param_name])

    # Priority 2: Command-line argument
    IF command_line HAS value FOR (task_family, param_name):
        RETURN parse_and_validate(command_line_value)

    # Priority 3: Configuration file
    IF config_file HAS section task_family WITH key param_name:
        RETURN parse_and_validate(config_value)

    # Priority 4: Coded default
    IF param_definition HAS default:
        RETURN param_definition.default

    RAISE MissingParameterException

The task identity function combines the task family name with the serialized values of all significant parameters:

FUNCTION task_id(task):
    family = task.get_task_family()
    significant_params = {name: serialize(value)
                          FOR (name, value) IN task.params
                          IF param_is_significant(name)}
    RETURN hash(family, significant_params)

This identity function ensures that two task instances are equal if and only if they share the same family and significant parameter values -- a property essential for idempotent scheduling.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment