Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataTalksClub Data engineering zoomcamp Pipeline Cleanup

From Leeroopedia


Metadata
Knowledge Sources Kestra Storage Documentation, Kestra Execution Lifecycle, Resource Management (Wikipedia)
Domains Resource Management, Pipeline Maintenance, Storage Optimization
Last Updated 2026-02-09 14:00 GMT

Overview

Resource cleanup after pipeline completion purges temporary files and intermediate artifacts to prevent storage accumulation across repeated pipeline executions.

Description

Orchestrated data pipelines generate intermediate artifacts during execution: downloaded source files, decompressed data, temporary working files, and task output references. These artifacts are stored in the orchestrator's internal storage system and persist beyond the lifetime of individual tasks. Without explicit cleanup, storage accumulation occurs as each pipeline execution adds new files without removing completed ones.

The resource cleanup principle addresses this by including a dedicated cleanup step at the end of every pipeline execution. This step:

  • Purges all output files from the current execution context, removing downloaded CSVs, intermediate transformations, and any other files registered as task outputs.
  • Operates at execution scope -- only files belonging to the current pipeline run are removed. Files from other executions or other flows remain untouched.
  • Runs unconditionally -- the cleanup step executes regardless of which conditional branches were taken during the pipeline, ensuring all possible artifacts are cleaned up.

The cleanup step is typically the final task in the pipeline's task list, positioned after all data has been successfully loaded into the target database. At that point, the intermediate files have served their purpose and are no longer needed.

Usage

Use pipeline resource cleanup when:

  • The pipeline downloads or generates files that are stored in the orchestrator's internal storage.
  • The pipeline runs frequently (scheduled or triggered) and would accumulate significant storage over time.
  • Intermediate artifacts are not needed after the pipeline completes successfully.
  • Storage quotas or cost constraints require proactive management of temporary files.

Caution: Disable cleanup during development and debugging when you need to inspect intermediate files to diagnose pipeline issues.

Theoretical Basis

PIPELINE EXECUTION:
  Step 1: Extract   -> produces file_A.csv in internal storage
  Step 2: Transform -> reads file_A.csv, may produce intermediate files
  Step 3: Load      -> reads file_A.csv, loads into database
  Step 4: Cleanup   -> purges all files from this execution

CLEANUP LOGIC:
  FOR EACH file IN current_execution.output_files:
      DELETE file FROM orchestrator_internal_storage

POST-CONDITION:
  current_execution.output_files == empty
  database tables contain the loaded data (unaffected)
  other executions' files are unaffected

STORAGE IMPACT OVER TIME:
  Without cleanup: storage_used = N_executions * avg_file_size
  With cleanup:    storage_used ~ 0 (only active executions hold files)

The cleanup step is a form of deterministic resource finalization -- it runs at a known point in the pipeline lifecycle (completion) and releases resources that are no longer needed. This is analogous to closing file handles or releasing memory in application programming, applied at the pipeline orchestration level.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment