Principle:DataTalksClub Data engineering zoomcamp Dataset Unification
| Page Metadata | |
|---|---|
| Knowledge Sources | DataTalksClub Data Engineering Zoomcamp |
| Domains | Data_Engineering, Batch_Processing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Dataset unification is the process of combining multiple homogeneous datasets into a single unified dataset through vertical concatenation, preserving origin information via a discriminator column.
Description
After schema normalization ensures that multiple datasets share identical column structures, the next step is to combine them into a single dataset. This operation is analogous to SQL's UNION ALL: all rows from each source are appended together into one result set, with no deduplication.
The unification process involves three key steps:
- Column selection -- Before combining, each dataset is projected down to a common set of columns. This ensures that only the shared columns participate in the union, discarding any source-specific columns that are not relevant to the combined analysis.
- Discriminator column injection -- A new column is added to each dataset before the union, containing a literal value that identifies the source. This preserves data lineage in the combined dataset, allowing downstream queries to filter or group by origin.
- Vertical concatenation -- The datasets are appended row-by-row. The resulting dataset contains all rows from all sources, with the discriminator column distinguishing their origin.
After unification, the combined dataset is typically registered as a named table (or view) so that it can be queried using SQL syntax. This bridges the DataFrame API and the SQL API, allowing analysts and engineers to use whichever query style is most appropriate.
Usage
Use dataset unification when:
- Multiple datasets share the same schema (after normalization) and represent different partitions of the same domain
- You need a single table for aggregation queries that span multiple data sources
- Preserving the origin of each row is important for downstream filtering or grouping
- You want to enable SQL-based querying over a combined dataset
Theoretical Basis
The unification pattern can be described as follows:
DEFINE common_columns = [col_1, col_2, ..., col_n]
FUNCTION unify_datasets(datasets, labels):
tagged_datasets = []
FOR EACH (dataset, label) IN zip(datasets, labels):
selected = select_columns(dataset, common_columns)
tagged = add_column(selected, "source_type", literal(label))
tagged_datasets.append(tagged)
unified = tagged_datasets[0]
FOR i = 1 TO length(tagged_datasets) - 1:
unified = vertical_concat(unified, tagged_datasets[i])
RETURN unified
combined = unify_datasets(
datasets = [dataset_a, dataset_b],
labels = ["source_a", "source_b"]
)
register_as_table(combined, "unified_data")
The UNION ALL semantics are important: unlike UNION (which removes duplicates), UNION ALL preserves all rows including duplicates. In batch analytics pipelines, deduplication is typically not desired at the union stage because:
- Duplicate detection across large datasets is expensive.
- Rows that appear identical may represent distinct real-world events.
- The discriminator column ensures that even rows with identical data values can be distinguished by their source.
The table registration step creates a logical view over the in-memory data. No data is copied or materialized -- it simply assigns a name that the SQL engine can resolve during query parsing.