Principle:DataTalksClub Data engineering zoomcamp Kestra Data Extraction

Metadata
Knowledge Sources	Kestra Tasks Documentation, Kestra Inputs Documentation, GNU Wget Manual
Domains	Data Extraction, ETL, Shell Scripting, Workflow Orchestration
Last Updated	2026-02-09 14:00 GMT

Overview

Data extraction from remote sources uses shell commands within orchestrated workflows to download and decompress files as managed tasks with captured outputs.

Description

The extraction phase of an ETL pipeline retrieves raw data from external sources and prepares it for downstream processing. When data is hosted as compressed flat files on remote servers, shell-based extraction provides a direct and efficient mechanism: a shell command downloads the file over HTTP and pipes the compressed stream through a decompression utility, producing a ready-to-process file in a single operation.

Within an orchestration platform, this shell command runs as a managed task rather than an ad-hoc script. The orchestrator:

Parameterizes the command -- input variables such as data type, year, and month are injected into the URL and filename using template expressions, making the extraction step reusable across different datasets.
Captures output files -- the orchestrator registers the produced files (matched via glob patterns like *.csv) as task outputs, making them available to subsequent pipeline steps through internal storage references.
Manages execution context -- the task runner controls the process lifecycle, captures exit codes, and integrates with the orchestrator's logging and retry mechanisms.

This approach avoids the need for custom extraction code or dedicated connector plugins when the data source provides straightforward HTTP file downloads.

Usage

Use shell-based extraction in orchestrated workflows when:

Source data is available as compressed flat files (CSV, JSON) on HTTP endpoints.
The download-and-decompress operation can be expressed as a single shell pipeline.
Input parameters (dataset type, date partitions) should be selectable by the user at execution time.
Output files must be tracked by the orchestrator for use in subsequent tasks.

Theoretical Basis

The extraction pattern follows this logic:

INPUTS:
  data_type = user_selected_value    -- e.g., 'yellow', 'green'
  year      = user_selected_value    -- e.g., '2019', '2020'
  month     = user_selected_value    -- e.g., '01' through '12'

CONSTRUCT url FROM template:
  url = base_url / {data_type} / {data_type}_tripdata_{year}-{month}.csv.gz

CONSTRUCT output filename:
  file = {data_type}_tripdata_{year}-{month}.csv

EXECUTE shell command:
  wget --quiet --output-document=- {url} | gunzip > {file}

REGISTER output files matching pattern "*.csv"
  -> file is stored in orchestrator internal storage
  -> downstream tasks reference file via output variable

The piped command (wget | gunzip) streams the compressed data directly through decompression without writing the intermediate .gz file to disk, reducing both I/O and storage requirements.

Related Pages

Implementation:DataTalksClub_Data_engineering_zoomcamp_Kestra_Shell_Commands

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment