Principle:Spotify Luigi External Data Sources
Overview
External data sources are pre-existing data assets that a pipeline declares as inputs without being responsible for producing them.
Description
In any data pipeline, not every piece of data is generated by the pipeline itself. Some inputs come from outside the pipeline's control: files placed on disk by another system, tables populated by an upstream ETL job, or datasets delivered by a third party. Rather than silently assuming these inputs exist, a well-designed pipeline explicitly declares each external dependency as a distinct unit within its workflow graph.
An external data source declaration serves three purposes:
- Visibility -- Every input is registered in the pipeline's dependency graph, making data lineage traceable.
- Completeness checking -- Before downstream work begins, the framework can verify that the external data actually exists, preventing wasted computation on missing inputs.
- Separation of concerns -- The pipeline author distinguishes between "work I must perform" and "data I expect to find," keeping each unit of work focused on a single responsibility.
Conceptually, an external data source is a no-op task -- a node in the execution graph that has an output but no run logic. Its sole contract is: "this data should already exist; if it does not, the pipeline should report the gap rather than attempt to create it."
Usage
Use the external data source pattern whenever:
- A file or dataset is produced by a process outside the current pipeline (e.g., a daily data dump from an upstream service).
- You need the scheduler to verify that prerequisite data is present before allowing dependent tasks to execute.
- You want to document and track every input your pipeline depends on, even those not created by the pipeline itself.
Theoretical Basis
The logic for handling external data sources follows a straightforward completeness-check algorithm:
FUNCTION check_external_source(source):
IF source.output_exists():
RETURN COMPLETE
ELSE:
RETURN INCOMPLETE -- signal the scheduler; do NOT attempt to produce the data
Because no computation step is defined, the scheduler treats the source as a leaf node in the directed acyclic graph (DAG). If the source is incomplete, the scheduler marks all downstream dependents as blocked and optionally retries the completeness check at a later time.
This pattern is an application of the Dependency Inversion Principle at the data level: rather than hard-coding file paths deep inside task logic, the pipeline declares its external data needs at the boundary of the DAG, making them explicit and testable.