Principle:DataTalksClub Data engineering zoomcamp Schema Normalization

Page Metadata
Knowledge Sources	DataTalksClub Data Engineering Zoomcamp
Domains	Data_Engineering, Batch_Processing
Last Updated	2026-02-09 14:00 GMT

Overview

Schema normalization is the process of harmonizing column names and structures across heterogeneous datasets so they can be combined into a unified schema.

Description

When data originates from multiple sources that represent the same domain but use different naming conventions, the column names must be aligned before the datasets can be merged. This is a common challenge in data engineering: two systems may capture the same concept (such as a pickup timestamp) but label it differently due to independent development, vendor differences, or historical naming decisions.

Schema normalization addresses this by renaming columns in each source dataset to conform to a common naming standard. The key considerations are:

Semantic equivalence -- The renamed columns must represent the same concept. For example, a "green taxi pickup time" and a "yellow taxi pickup time" are semantically the same field despite having different prefixes.
Non-destructive transformation -- Renaming produces a new dataset with updated column names; the original data is not modified.
Prerequisite for union operations -- Combining datasets vertically (appending rows) requires that both datasets share identical column names and compatible types.
Preserving data lineage -- Even after renaming, the transformation should be traceable. A reader of the pipeline should be able to understand that pickup_datetime originated from lpep_pickup_datetime in one source and tpep_pickup_datetime in another.

Usage

Use schema normalization when:

Merging datasets from different sources that represent the same entity or event type
Column names differ by prefix, suffix, or convention across sources
Preparing data for a UNION operation that requires matching column names
Standardizing naming conventions as part of a data quality or governance initiative

Theoretical Basis

Schema normalization can be expressed as a mapping function applied to each dataset:

FUNCTION normalize_schema(dataset, rename_mapping):
    FOR EACH (old_name, new_name) IN rename_mapping:
        dataset = rename_column(dataset, old_name, new_name)
    RETURN dataset

rename_map_source_a = {
    "source_a_pickup_time" -> "pickup_time",
    "source_a_dropoff_time" -> "dropoff_time"
}

rename_map_source_b = {
    "source_b_pickup_time" -> "pickup_time",
    "source_b_dropoff_time" -> "dropoff_time"
}

dataset_a = normalize_schema(raw_dataset_a, rename_map_source_a)
dataset_b = normalize_schema(raw_dataset_b, rename_map_source_b)

-- After normalization, both datasets share the same column names
-- and can be combined via UNION

The normalization function is idempotent in the sense that applying it to an already-normalized dataset with the same mapping produces no change. It is also composable -- multiple rename operations can be chained together, each transforming the output of the previous one.

The critical design decision is determining the canonical name for each concept. In the taxi data domain, the convention is to strip the service-type prefix (e.g., lpep_ or tpep_) and use the generic form (pickup_datetime, dropoff_datetime), since the service type will be tracked separately via a discriminator column after unification.

Related Pages

Implementation:DataTalksClub_Data_engineering_zoomcamp_Spark_WithColumnRenamed

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment