Principle:DataTalksClub Data engineering zoomcamp Dataset Unification

Page Metadata
Knowledge Sources	DataTalksClub Data Engineering Zoomcamp
Domains	Data_Engineering, Batch_Processing
Last Updated	2026-02-09 14:00 GMT

Overview

Dataset unification is the process of combining multiple homogeneous datasets into a single unified dataset through vertical concatenation, preserving origin information via a discriminator column.

Description

After schema normalization ensures that multiple datasets share identical column structures, the next step is to combine them into a single dataset. This operation is analogous to SQL's UNION ALL: all rows from each source are appended together into one result set, with no deduplication.

The unification process involves three key steps:

Column selection -- Before combining, each dataset is projected down to a common set of columns. This ensures that only the shared columns participate in the union, discarding any source-specific columns that are not relevant to the combined analysis.
Discriminator column injection -- A new column is added to each dataset before the union, containing a literal value that identifies the source. This preserves data lineage in the combined dataset, allowing downstream queries to filter or group by origin.
Vertical concatenation -- The datasets are appended row-by-row. The resulting dataset contains all rows from all sources, with the discriminator column distinguishing their origin.

After unification, the combined dataset is typically registered as a named table (or view) so that it can be queried using SQL syntax. This bridges the DataFrame API and the SQL API, allowing analysts and engineers to use whichever query style is most appropriate.

Usage

Use dataset unification when:

Multiple datasets share the same schema (after normalization) and represent different partitions of the same domain
You need a single table for aggregation queries that span multiple data sources
Preserving the origin of each row is important for downstream filtering or grouping
You want to enable SQL-based querying over a combined dataset

Theoretical Basis

The unification pattern can be described as follows:

DEFINE common_columns = [col_1, col_2, ..., col_n]

FUNCTION unify_datasets(datasets, labels):
    tagged_datasets = []
    FOR EACH (dataset, label) IN zip(datasets, labels):
        selected = select_columns(dataset, common_columns)
        tagged = add_column(selected, "source_type", literal(label))
        tagged_datasets.append(tagged)

    unified = tagged_datasets[0]
    FOR i = 1 TO length(tagged_datasets) - 1:
        unified = vertical_concat(unified, tagged_datasets[i])

    RETURN unified

combined = unify_datasets(
    datasets = [dataset_a, dataset_b],
    labels   = ["source_a", "source_b"]
)

register_as_table(combined, "unified_data")

The UNION ALL semantics are important: unlike UNION (which removes duplicates), UNION ALL preserves all rows including duplicates. In batch analytics pipelines, deduplication is typically not desired at the union stage because:

Duplicate detection across large datasets is expensive.
Rows that appear identical may represent distinct real-world events.
The discriminator column ensures that even rows with identical data values can be distinguished by their source.

The table registration step creates a logical view over the in-memory data. No data is copied or materialized -- it simply assigns a name that the SQL engine can resolve during query parsing.

Related Pages

Implementation:DataTalksClub_Data_engineering_zoomcamp_Spark_UnionAll

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment