Principle:DataTalksClub Data engineering zoomcamp Kestra Data Extraction
| Metadata | |
|---|---|
| Knowledge Sources | Kestra Tasks Documentation, Kestra Inputs Documentation, GNU Wget Manual |
| Domains | Data Extraction, ETL, Shell Scripting, Workflow Orchestration |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Data extraction from remote sources uses shell commands within orchestrated workflows to download and decompress files as managed tasks with captured outputs.
Description
The extraction phase of an ETL pipeline retrieves raw data from external sources and prepares it for downstream processing. When data is hosted as compressed flat files on remote servers, shell-based extraction provides a direct and efficient mechanism: a shell command downloads the file over HTTP and pipes the compressed stream through a decompression utility, producing a ready-to-process file in a single operation.
Within an orchestration platform, this shell command runs as a managed task rather than an ad-hoc script. The orchestrator:
- Parameterizes the command -- input variables such as data type, year, and month are injected into the URL and filename using template expressions, making the extraction step reusable across different datasets.
- Captures output files -- the orchestrator registers the produced files (matched via glob patterns like
*.csv) as task outputs, making them available to subsequent pipeline steps through internal storage references. - Manages execution context -- the task runner controls the process lifecycle, captures exit codes, and integrates with the orchestrator's logging and retry mechanisms.
This approach avoids the need for custom extraction code or dedicated connector plugins when the data source provides straightforward HTTP file downloads.
Usage
Use shell-based extraction in orchestrated workflows when:
- Source data is available as compressed flat files (CSV, JSON) on HTTP endpoints.
- The download-and-decompress operation can be expressed as a single shell pipeline.
- Input parameters (dataset type, date partitions) should be selectable by the user at execution time.
- Output files must be tracked by the orchestrator for use in subsequent tasks.
Theoretical Basis
The extraction pattern follows this logic:
INPUTS:
data_type = user_selected_value -- e.g., 'yellow', 'green'
year = user_selected_value -- e.g., '2019', '2020'
month = user_selected_value -- e.g., '01' through '12'
CONSTRUCT url FROM template:
url = base_url / {data_type} / {data_type}_tripdata_{year}-{month}.csv.gz
CONSTRUCT output filename:
file = {data_type}_tripdata_{year}-{month}.csv
EXECUTE shell command:
wget --quiet --output-document=- {url} | gunzip > {file}
REGISTER output files matching pattern "*.csv"
-> file is stored in orchestrator internal storage
-> downstream tasks reference file via output variable
The piped command (wget | gunzip) streams the compressed data directly through decompression without writing the intermediate .gz file to disk, reducing both I/O and storage requirements.