Principle:DataTalksClub Data engineering zoomcamp Data Deduplication
| Metadata | |
|---|---|
| Knowledge Sources | MD5 Hash Function (Wikipedia), PostgreSQL String Functions, Data Deduplication (Wikipedia) |
| Domains | Data Quality, ETL, Deduplication, Data Integrity |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Data deduplication uses deterministic hash keys computed from business key columns to create unique row identifiers for detecting and preventing duplicate records across pipeline executions.
Description
In data pipelines that may process overlapping datasets or be re-executed against the same source data, deduplication is essential to maintain data integrity. The deterministic hash key approach works by selecting a set of columns that together form a natural business key -- a combination of fields that uniquely identifies a real-world event or entity -- and computing a fixed-length hash from their concatenated values.
The key properties of this approach are:
- Deterministic -- the same input values always produce the same hash, so the same record processed in different pipeline runs will generate an identical key. This is the foundation for detecting duplicates.
- Fixed-length output -- regardless of input column sizes, the hash produces a constant-length string (32 hex characters for MD5), making it efficient for indexing and comparison.
- NULL handling -- the
COALESCEfunction converts NULL values to empty strings before hashing, ensuring that NULLs do not produce unpredictable results or collisions. Without this, any row with a NULL in a key column would produce a NULL hash. - Type casting -- non-text columns (timestamps, numbers) are explicitly cast to text before concatenation, ensuring consistent string representation.
In addition to the deduplication hash, a lineage tracking field records the source filename, enabling traceability from any row in the production table back to the specific file it originated from.
Usage
Use deterministic hash key deduplication when:
- Pipeline re-runs may process previously loaded data, and duplicates must be prevented.
- The source data does not contain a natural unique identifier or primary key.
- A composite business key can be formed from multiple columns that together identify a unique record.
- Data lineage tracking (source file provenance) is required alongside deduplication.
Theoretical Basis
GIVEN staging_table with loaded data and NULL unique_row_id values
SELECT business key columns:
key_columns = [VendorID, pickup_datetime, dropoff_datetime,
PULocationID, DOLocationID, fare_amount, trip_distance]
FOR EACH row IN staging_table:
concatenated = ""
FOR EACH col IN key_columns:
value = COALESCE(CAST(row[col] AS text), '')
concatenated = concatenated || value
row.unique_row_id = MD5(concatenated)
row.filename = source_file_name
EXECUTE as single UPDATE:
UPDATE staging_table
SET unique_row_id = MD5(
COALESCE(CAST(col1 AS text), '') ||
COALESCE(CAST(col2 AS text), '') ||
... ||
COALESCE(CAST(colN AS text), '')
),
filename = '{source_file}'
Collision risk: MD5 produces a 128-bit hash, giving a collision probability of approximately 1 in 2^64 for a dataset (by the birthday paradox). For typical data pipeline volumes (millions to low billions of rows), this collision probability is negligible. MD5 is used here for speed and convenience, not for cryptographic security.
Choice of key columns: The 7 selected columns (vendor, pickup/dropoff times, pickup/dropoff locations, fare, distance) represent a trip's core identity. Two trips with identical values across all these fields are considered the same real-world event.