Principle:DataTalksClub Data engineering zoomcamp Pipeline Cleanup
| Metadata | |
|---|---|
| Knowledge Sources | Kestra Storage Documentation, Kestra Execution Lifecycle, Resource Management (Wikipedia) |
| Domains | Resource Management, Pipeline Maintenance, Storage Optimization |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Resource cleanup after pipeline completion purges temporary files and intermediate artifacts to prevent storage accumulation across repeated pipeline executions.
Description
Orchestrated data pipelines generate intermediate artifacts during execution: downloaded source files, decompressed data, temporary working files, and task output references. These artifacts are stored in the orchestrator's internal storage system and persist beyond the lifetime of individual tasks. Without explicit cleanup, storage accumulation occurs as each pipeline execution adds new files without removing completed ones.
The resource cleanup principle addresses this by including a dedicated cleanup step at the end of every pipeline execution. This step:
- Purges all output files from the current execution context, removing downloaded CSVs, intermediate transformations, and any other files registered as task outputs.
- Operates at execution scope -- only files belonging to the current pipeline run are removed. Files from other executions or other flows remain untouched.
- Runs unconditionally -- the cleanup step executes regardless of which conditional branches were taken during the pipeline, ensuring all possible artifacts are cleaned up.
The cleanup step is typically the final task in the pipeline's task list, positioned after all data has been successfully loaded into the target database. At that point, the intermediate files have served their purpose and are no longer needed.
Usage
Use pipeline resource cleanup when:
- The pipeline downloads or generates files that are stored in the orchestrator's internal storage.
- The pipeline runs frequently (scheduled or triggered) and would accumulate significant storage over time.
- Intermediate artifacts are not needed after the pipeline completes successfully.
- Storage quotas or cost constraints require proactive management of temporary files.
Caution: Disable cleanup during development and debugging when you need to inspect intermediate files to diagnose pipeline issues.
Theoretical Basis
PIPELINE EXECUTION:
Step 1: Extract -> produces file_A.csv in internal storage
Step 2: Transform -> reads file_A.csv, may produce intermediate files
Step 3: Load -> reads file_A.csv, loads into database
Step 4: Cleanup -> purges all files from this execution
CLEANUP LOGIC:
FOR EACH file IN current_execution.output_files:
DELETE file FROM orchestrator_internal_storage
POST-CONDITION:
current_execution.output_files == empty
database tables contain the loaded data (unaffected)
other executions' files are unaffected
STORAGE IMPACT OVER TIME:
Without cleanup: storage_used = N_executions * avg_file_size
With cleanup: storage_used ~ 0 (only active executions hold files)
The cleanup step is a form of deterministic resource finalization -- it runs at a known point in the pipeline lifecycle (completion) and releases resources that are no longer needed. This is analogous to closing file handles or releasing memory in application programming, applied at the pipeline orchestration level.