Principle:Eventual Inc Daft Data Joining
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Analysis |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Data joining is the technique for combining two DataFrames by matching rows on shared key columns, supporting multiple join types and strategies.
Description
Data joining merges rows from two DataFrames based on matching values in specified key columns. Daft supports inner, left, right, outer, anti, semi, and cross join types, covering the full spectrum of relational join semantics. Users can configure the join strategy (hash, sort-merge, or broadcast) to optimize performance for different data distributions, or leave it to the query optimizer to choose automatically. When column name collisions occur between the two DataFrames, conflicting right-side columns are prefixed (default "right.") or suffixed to avoid ambiguity.
Usage
Use data joining when you need to combine data from two sources based on common keys. Common scenarios include enriching fact tables with dimension data, merging user profiles with activity logs, and combining results from different processing stages. Choose the join type based on your requirements: inner for matched rows only, left/right for preserving one side, outer for preserving both sides, anti for exclusion, and semi for existence checks.
Theoretical Basis
Data joining implements the relational join operation, which combines tuples from two relations based on a join predicate. The main join types are:
Inner Join: R INNER JOIN S ON R.key = S.key -- only matching rows
Left Join: R LEFT JOIN S ON R.key = S.key -- all rows from R, matched from S
Right Join: R RIGHT JOIN S ON R.key = S.key -- all rows from S, matched from R
Outer Join: R FULL OUTER JOIN S ON R.key = S.key -- all rows from both
Anti Join: R ANTI JOIN S ON R.key = S.key -- rows in R with no match in S
Semi Join: R SEMI JOIN S ON R.key = S.key -- rows in R with a match in S
Cross Join: R CROSS JOIN S -- Cartesian product
Join strategies affect execution performance:
- Hash join: Builds a hash table on the smaller relation, probes with the larger. Good for equi-joins.
- Sort-merge join: Sorts both relations by join key, then merges. Efficient when data is pre-sorted. Only supports inner joins.
- Broadcast join: Broadcasts the smaller relation to all partitions. Efficient when one side is small.