Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:DataTalksClub Data engineering zoomcamp Merge Upsert

From Leeroopedia
Revision as of 18:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/DataTalksClub_Data_engineering_zoomcamp_Merge_Upsert.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Metadata
Knowledge Sources PostgreSQL MERGE Documentation, SQL MERGE Standard (Wikipedia), Kestra ETL Pipeline Blog
Domains Data Loading, ETL, Idempotency, SQL
Last Updated 2026-02-09 14:00 GMT

Overview

Idempotent data loading uses the SQL MERGE (upsert) pattern to ensure that only new records are inserted into the production table, making pipeline re-runs safe and preventing duplicate data.

Description

The final stage of a staging-based ETL pipeline must transfer validated data from the staging table to the production table. A naive INSERT would create duplicates if the pipeline is re-run against the same source data. The MERGE statement (also known as upsert) solves this by combining the semantics of INSERT, UPDATE, and DELETE into a single atomic operation that compares source and target rows.

The SQL MERGE pattern operates as follows:

  • MERGE INTO -- specifies the target (production) table.
  • USING -- specifies the source (staging) table containing the new data.
  • ON -- defines the match condition, typically comparing the unique hash key between source and target rows.
  • WHEN NOT MATCHED THEN INSERT -- for rows in the staging table that have no corresponding match in the production table, insert them as new records.
  • WHEN MATCHED THEN UPDATE (optional) -- for rows that exist in both tables, update the target row with new values. This clause is omitted when append-only semantics are desired.

The key benefit is idempotency: running the same pipeline multiple times with the same input data will produce the same result in the production table. The first run inserts all rows; subsequent runs find all rows already matched and insert nothing.

Usage

Use MERGE/upsert-based loading when:

  • Pipeline re-runs must be safe and not create duplicate records.
  • A staging-to-production promotion pattern is used with a deterministic unique key.
  • Append-only semantics are desired (new rows are added, existing rows are not modified).
  • The database supports the SQL MERGE statement (PostgreSQL 15+, SQL Server, Oracle, etc.).

Theoretical Basis

PRE-CONDITIONS:
  production_table contains previously loaded data (or is empty on first run)
  staging_table contains new data with unique_row_id populated

MERGE operation:
  FOR EACH row S IN staging_table:
      SEARCH production_table FOR row T WHERE T.unique_row_id = S.unique_row_id

      IF no match found (T does not exist):
          INSERT S into production_table
      ELSE IF match found (T exists):
          SKIP (do nothing -- row already loaded)

POST-CONDITIONS:
  production_table contains the UNION of:
    - all previously existing rows (unchanged)
    - all new rows from staging_table that were not previously present
  No duplicates exist (guaranteed by unique_row_id match)

IDEMPOTENCY PROOF:
  Run 1: staging has rows {A, B, C}, production is empty
         -> inserts {A, B, C}
  Run 2: staging has rows {A, B, C} again
         -> all match existing rows, 0 inserts
  Result: production has exactly {A, B, C} after both runs

Note on PostgreSQL version: The MERGE statement was introduced in PostgreSQL 15 (released October 2022) as part of the SQL:2003 standard. Earlier PostgreSQL versions require the INSERT ... ON CONFLICT syntax as an alternative upsert mechanism, but this requires a unique constraint or index on the match column.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment