Principle:DataTalksClub Data engineering zoomcamp Kestra Staging Load

Metadata
Knowledge Sources	PostgreSQL COPY Documentation, Kestra PostgreSQL Plugin Guide, Bulk Data Loading (Wikipedia)
Domains	Data Loading, ETL, Database Operations, Performance Optimization
Last Updated	2026-02-09 14:00 GMT

Overview

Bulk data loading uses the database-native COPY protocol for high-throughput transfer of flat file data into staging tables, bypassing the overhead of row-by-row INSERT operations.

Description

When loading large volumes of data from flat files into a relational database, the standard SQL INSERT statement processes one row at a time, incurring significant overhead from parsing, planning, and transaction management per row. The COPY protocol is a database-native mechanism that streams data directly into a table in a binary or text format, achieving dramatically higher throughput.

The bulk loading pattern in an ETL pipeline operates as follows:

Pre-condition: clean staging table -- before loading, the staging table is truncated to remove any data from previous executions. This ensures that each pipeline run starts with an empty staging area.
COPY operation -- the file is streamed from the orchestrator's internal storage into the database using the COPY protocol. The operation specifies the file format (CSV), whether a header row is present, the target table, and the explicit column list to map file columns to table columns.
Column mapping -- an explicit column list in the COPY command ensures that file columns are mapped to the correct table columns by name rather than by ordinal position. This is critical when the table contains additional columns (such as unique_row_id and filename) that are not present in the source file.

The staging table serves as a temporary landing zone where data quality operations (deduplication, lineage tagging) can be performed before the data is promoted to the production table.

Usage

Use bulk data loading via the COPY protocol when:

Loading CSV or other delimited flat files into PostgreSQL or compatible databases.
The dataset is large enough that row-by-row INSERT would be prohibitively slow.
The target table has additional columns not present in the source file, requiring explicit column mapping.
A staging-first loading pattern is used to enable data quality checks before production merge.

Theoretical Basis

PRE-CONDITION:
  staging_table exists with schema matching source file + infrastructure columns
  staging_table has been TRUNCATED (empty)

INPUT:
  csv_file = file from orchestrator internal storage (output of extraction step)
  target_table = staging_table name
  columns = [col1, col2, ..., colN]   -- source file columns only (excludes unique_row_id, filename)

EXECUTE COPY:
  COPY {target_table} ({columns})
  FROM STDIN
  WITH (FORMAT CSV, HEADER true)
  -- stream csv_file contents to STDIN

POST-CONDITION:
  staging_table contains all rows from csv_file
  unique_row_id and filename columns are NULL (populated in subsequent steps)
  Row count in staging_table == row count in csv_file (minus header)

The COPY protocol bypasses the SQL parser for data rows, sending data directly to the table's storage layer. This eliminates per-row overhead and can be 10-100x faster than equivalent INSERT statements for large datasets.

Related Pages

Implementation:DataTalksClub_Data_engineering_zoomcamp_Kestra_PostgreSQL_CopyIn

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment