Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataTalksClub Data engineering zoomcamp Merge Upsert

From Leeroopedia


Metadata
Knowledge Sources PostgreSQL MERGE Documentation, SQL MERGE Standard (Wikipedia), Kestra ETL Pipeline Blog
Domains Data Loading, ETL, Idempotency, SQL
Last Updated 2026-02-09 14:00 GMT

Overview

Idempotent data loading uses the SQL MERGE (upsert) pattern to ensure that only new records are inserted into the production table, making pipeline re-runs safe and preventing duplicate data.

Description

The final stage of a staging-based ETL pipeline must transfer validated data from the staging table to the production table. A naive INSERT would create duplicates if the pipeline is re-run against the same source data. The MERGE statement (also known as upsert) solves this by combining the semantics of INSERT, UPDATE, and DELETE into a single atomic operation that compares source and target rows.

The SQL MERGE pattern operates as follows:

  • MERGE INTO -- specifies the target (production) table.
  • USING -- specifies the source (staging) table containing the new data.
  • ON -- defines the match condition, typically comparing the unique hash key between source and target rows.
  • WHEN NOT MATCHED THEN INSERT -- for rows in the staging table that have no corresponding match in the production table, insert them as new records.
  • WHEN MATCHED THEN UPDATE (optional) -- for rows that exist in both tables, update the target row with new values. This clause is omitted when append-only semantics are desired.

The key benefit is idempotency: running the same pipeline multiple times with the same input data will produce the same result in the production table. The first run inserts all rows; subsequent runs find all rows already matched and insert nothing.

Usage

Use MERGE/upsert-based loading when:

  • Pipeline re-runs must be safe and not create duplicate records.
  • A staging-to-production promotion pattern is used with a deterministic unique key.
  • Append-only semantics are desired (new rows are added, existing rows are not modified).
  • The database supports the SQL MERGE statement (PostgreSQL 15+, SQL Server, Oracle, etc.).

Theoretical Basis

PRE-CONDITIONS:
  production_table contains previously loaded data (or is empty on first run)
  staging_table contains new data with unique_row_id populated

MERGE operation:
  FOR EACH row S IN staging_table:
      SEARCH production_table FOR row T WHERE T.unique_row_id = S.unique_row_id

      IF no match found (T does not exist):
          INSERT S into production_table
      ELSE IF match found (T exists):
          SKIP (do nothing -- row already loaded)

POST-CONDITIONS:
  production_table contains the UNION of:
    - all previously existing rows (unchanged)
    - all new rows from staging_table that were not previously present
  No duplicates exist (guaranteed by unique_row_id match)

IDEMPOTENCY PROOF:
  Run 1: staging has rows {A, B, C}, production is empty
         -> inserts {A, B, C}
  Run 2: staging has rows {A, B, C} again
         -> all match existing rows, 0 inserts
  Result: production has exactly {A, B, C} after both runs

Note on PostgreSQL version: The MERGE statement was introduced in PostgreSQL 15 (released October 2022) as part of the SQL:2003 standard. Earlier PostgreSQL versions require the INSERT ... ON CONFLICT syntax as an alternative upsert mechanism, but this requires a unique constraint or index on the match column.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment