Principle:DataTalksClub Data engineering zoomcamp Merge Upsert
| Metadata | |
|---|---|
| Knowledge Sources | PostgreSQL MERGE Documentation, SQL MERGE Standard (Wikipedia), Kestra ETL Pipeline Blog |
| Domains | Data Loading, ETL, Idempotency, SQL |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Idempotent data loading uses the SQL MERGE (upsert) pattern to ensure that only new records are inserted into the production table, making pipeline re-runs safe and preventing duplicate data.
Description
The final stage of a staging-based ETL pipeline must transfer validated data from the staging table to the production table. A naive INSERT would create duplicates if the pipeline is re-run against the same source data. The MERGE statement (also known as upsert) solves this by combining the semantics of INSERT, UPDATE, and DELETE into a single atomic operation that compares source and target rows.
The SQL MERGE pattern operates as follows:
- MERGE INTO -- specifies the target (production) table.
- USING -- specifies the source (staging) table containing the new data.
- ON -- defines the match condition, typically comparing the unique hash key between source and target rows.
- WHEN NOT MATCHED THEN INSERT -- for rows in the staging table that have no corresponding match in the production table, insert them as new records.
- WHEN MATCHED THEN UPDATE (optional) -- for rows that exist in both tables, update the target row with new values. This clause is omitted when append-only semantics are desired.
The key benefit is idempotency: running the same pipeline multiple times with the same input data will produce the same result in the production table. The first run inserts all rows; subsequent runs find all rows already matched and insert nothing.
Usage
Use MERGE/upsert-based loading when:
- Pipeline re-runs must be safe and not create duplicate records.
- A staging-to-production promotion pattern is used with a deterministic unique key.
- Append-only semantics are desired (new rows are added, existing rows are not modified).
- The database supports the SQL MERGE statement (PostgreSQL 15+, SQL Server, Oracle, etc.).
Theoretical Basis
PRE-CONDITIONS:
production_table contains previously loaded data (or is empty on first run)
staging_table contains new data with unique_row_id populated
MERGE operation:
FOR EACH row S IN staging_table:
SEARCH production_table FOR row T WHERE T.unique_row_id = S.unique_row_id
IF no match found (T does not exist):
INSERT S into production_table
ELSE IF match found (T exists):
SKIP (do nothing -- row already loaded)
POST-CONDITIONS:
production_table contains the UNION of:
- all previously existing rows (unchanged)
- all new rows from staging_table that were not previously present
No duplicates exist (guaranteed by unique_row_id match)
IDEMPOTENCY PROOF:
Run 1: staging has rows {A, B, C}, production is empty
-> inserts {A, B, C}
Run 2: staging has rows {A, B, C} again
-> all match existing rows, 0 inserts
Result: production has exactly {A, B, C} after both runs
Note on PostgreSQL version: The MERGE statement was introduced in PostgreSQL 15 (released October 2022) as part of the SQL:2003 standard. Earlier PostgreSQL versions require the INSERT ... ON CONFLICT syntax as an alternative upsert mechanism, but this requires a unique constraint or index on the match column.