Principle:DataTalksClub Data engineering zoomcamp Kestra Staging Load
| Metadata | |
|---|---|
| Knowledge Sources | PostgreSQL COPY Documentation, Kestra PostgreSQL Plugin Guide, Bulk Data Loading (Wikipedia) |
| Domains | Data Loading, ETL, Database Operations, Performance Optimization |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Bulk data loading uses the database-native COPY protocol for high-throughput transfer of flat file data into staging tables, bypassing the overhead of row-by-row INSERT operations.
Description
When loading large volumes of data from flat files into a relational database, the standard SQL INSERT statement processes one row at a time, incurring significant overhead from parsing, planning, and transaction management per row. The COPY protocol is a database-native mechanism that streams data directly into a table in a binary or text format, achieving dramatically higher throughput.
The bulk loading pattern in an ETL pipeline operates as follows:
- Pre-condition: clean staging table -- before loading, the staging table is truncated to remove any data from previous executions. This ensures that each pipeline run starts with an empty staging area.
- COPY operation -- the file is streamed from the orchestrator's internal storage into the database using the COPY protocol. The operation specifies the file format (CSV), whether a header row is present, the target table, and the explicit column list to map file columns to table columns.
- Column mapping -- an explicit column list in the COPY command ensures that file columns are mapped to the correct table columns by name rather than by ordinal position. This is critical when the table contains additional columns (such as
unique_row_idandfilename) that are not present in the source file.
The staging table serves as a temporary landing zone where data quality operations (deduplication, lineage tagging) can be performed before the data is promoted to the production table.
Usage
Use bulk data loading via the COPY protocol when:
- Loading CSV or other delimited flat files into PostgreSQL or compatible databases.
- The dataset is large enough that row-by-row INSERT would be prohibitively slow.
- The target table has additional columns not present in the source file, requiring explicit column mapping.
- A staging-first loading pattern is used to enable data quality checks before production merge.
Theoretical Basis
PRE-CONDITION:
staging_table exists with schema matching source file + infrastructure columns
staging_table has been TRUNCATED (empty)
INPUT:
csv_file = file from orchestrator internal storage (output of extraction step)
target_table = staging_table name
columns = [col1, col2, ..., colN] -- source file columns only (excludes unique_row_id, filename)
EXECUTE COPY:
COPY {target_table} ({columns})
FROM STDIN
WITH (FORMAT CSV, HEADER true)
-- stream csv_file contents to STDIN
POST-CONDITION:
staging_table contains all rows from csv_file
unique_row_id and filename columns are NULL (populated in subsequent steps)
Row count in staging_table == row count in csv_file (minus header)
The COPY protocol bypasses the SQL parser for data rows, sending data directly to the table's storage layer. This eliminates per-row overhead and can be 10-100x faster than equivalent INSERT statements for large datasets.