Principle:DataTalksClub Data engineering zoomcamp Schema Normalization
| Page Metadata | |
|---|---|
| Knowledge Sources | DataTalksClub Data Engineering Zoomcamp |
| Domains | Data_Engineering, Batch_Processing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Schema normalization is the process of harmonizing column names and structures across heterogeneous datasets so they can be combined into a unified schema.
Description
When data originates from multiple sources that represent the same domain but use different naming conventions, the column names must be aligned before the datasets can be merged. This is a common challenge in data engineering: two systems may capture the same concept (such as a pickup timestamp) but label it differently due to independent development, vendor differences, or historical naming decisions.
Schema normalization addresses this by renaming columns in each source dataset to conform to a common naming standard. The key considerations are:
- Semantic equivalence -- The renamed columns must represent the same concept. For example, a "green taxi pickup time" and a "yellow taxi pickup time" are semantically the same field despite having different prefixes.
- Non-destructive transformation -- Renaming produces a new dataset with updated column names; the original data is not modified.
- Prerequisite for union operations -- Combining datasets vertically (appending rows) requires that both datasets share identical column names and compatible types.
- Preserving data lineage -- Even after renaming, the transformation should be traceable. A reader of the pipeline should be able to understand that
pickup_datetimeoriginated fromlpep_pickup_datetimein one source andtpep_pickup_datetimein another.
Usage
Use schema normalization when:
- Merging datasets from different sources that represent the same entity or event type
- Column names differ by prefix, suffix, or convention across sources
- Preparing data for a UNION operation that requires matching column names
- Standardizing naming conventions as part of a data quality or governance initiative
Theoretical Basis
Schema normalization can be expressed as a mapping function applied to each dataset:
FUNCTION normalize_schema(dataset, rename_mapping):
FOR EACH (old_name, new_name) IN rename_mapping:
dataset = rename_column(dataset, old_name, new_name)
RETURN dataset
rename_map_source_a = {
"source_a_pickup_time" -> "pickup_time",
"source_a_dropoff_time" -> "dropoff_time"
}
rename_map_source_b = {
"source_b_pickup_time" -> "pickup_time",
"source_b_dropoff_time" -> "dropoff_time"
}
dataset_a = normalize_schema(raw_dataset_a, rename_map_source_a)
dataset_b = normalize_schema(raw_dataset_b, rename_map_source_b)
-- After normalization, both datasets share the same column names
-- and can be combined via UNION
The normalization function is idempotent in the sense that applying it to an already-normalized dataset with the same mapping produces no change. It is also composable -- multiple rename operations can be chained together, each transforming the output of the previous one.
The critical design decision is determining the canonical name for each concept. In the taxi data domain, the convention is to strip the service-type prefix (e.g., lpep_ or tpep_) and use the generic form (pickup_datetime, dropoff_datetime), since the service type will be tracked separately via a discriminator column after unification.