Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Eventual Inc Daft Data Joining

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Data_Analysis
Last Updated 2026-02-08 00:00 GMT

Overview

Data joining is the technique for combining two DataFrames by matching rows on shared key columns, supporting multiple join types and strategies.

Description

Data joining merges rows from two DataFrames based on matching values in specified key columns. Daft supports inner, left, right, outer, anti, semi, and cross join types, covering the full spectrum of relational join semantics. Users can configure the join strategy (hash, sort-merge, or broadcast) to optimize performance for different data distributions, or leave it to the query optimizer to choose automatically. When column name collisions occur between the two DataFrames, conflicting right-side columns are prefixed (default "right.") or suffixed to avoid ambiguity.

Usage

Use data joining when you need to combine data from two sources based on common keys. Common scenarios include enriching fact tables with dimension data, merging user profiles with activity logs, and combining results from different processing stages. Choose the join type based on your requirements: inner for matched rows only, left/right for preserving one side, outer for preserving both sides, anti for exclusion, and semi for existence checks.

Theoretical Basis

Data joining implements the relational join operation, which combines tuples from two relations based on a join predicate. The main join types are:

Inner Join:    R INNER JOIN S ON R.key = S.key    -- only matching rows
Left Join:     R LEFT JOIN S ON R.key = S.key     -- all rows from R, matched from S
Right Join:    R RIGHT JOIN S ON R.key = S.key    -- all rows from S, matched from R
Outer Join:    R FULL OUTER JOIN S ON R.key = S.key -- all rows from both
Anti Join:     R ANTI JOIN S ON R.key = S.key     -- rows in R with no match in S
Semi Join:     R SEMI JOIN S ON R.key = S.key     -- rows in R with a match in S
Cross Join:    R CROSS JOIN S                      -- Cartesian product

Join strategies affect execution performance:

  • Hash join: Builds a hash table on the smaller relation, probes with the larger. Good for equi-joins.
  • Sort-merge join: Sorts both relations by join key, then merges. Efficient when data is pre-sorted. Only supports inner joins.
  • Broadcast join: Broadcasts the smaller relation to all partitions. Efficient when one side is small.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment