Implementation:DataExpert io Data engineer handbook Do team vertex transformation
Appearance
Overview
This page documents the do_team_vertex_transformation function, which converts relational NBA team data into a graph vertex format. The function deduplicates teams and produces vertices with an identifier, type label, and property map.
Type
API Doc
Source
team_vertex_job.py:L1-36 (full file)
Signature
def do_team_vertex_transformation(spark, dataframe) -> DataFrame
Import
from src.jobs.team_vertex_job import do_team_vertex_transformation
Inputs / Outputs
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | spark | SparkSession | An active SparkSession instance for executing SQL |
| Input | dataframe | DataFrame | A DataFrame containing columns: team_id, abbreviation, nickname, city, arena, yearfounded
|
| Output | result | DataFrame | A DataFrame containing columns: identifier, type, properties (map)
|
SQL Query Structure
The function registers the input DataFrame as a temporary view and executes a SQL query with one CTE:
- teams_deduped - Uses
ROW_NUMBER() OVER (PARTITION BY team_id ORDER BY ...)to deduplicate team records, keeping only the first occurrence of eachteam_id.
The final SELECT produces the vertex format:
WITH teams_deduped AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY team_id ORDER BY team_id) AS row_num
FROM teams_raw
)
SELECT
team_id AS identifier,
'team' AS type,
MAP(
'abbreviation', abbreviation,
'nickname', nickname,
'city', city,
'arena', arena,
'yearfounded', CAST(yearfounded AS VARCHAR)
) AS properties
FROM teams_deduped
WHERE row_num = 1
Usage Example
spark = SparkSession.builder.master("local").appName("team_vertex").getOrCreate()
input_df = spark.read.table("teams_raw")
output_df = do_team_vertex_transformation(spark, input_df)
output_df.write.mode("overwrite").insertInto("team_vertices")
Related Pages
- Principle:DataExpert_io_Data_engineer_handbook_Graph_Vertex_Generation
- Environment:DataExpert_io_Data_engineer_handbook_Spark_Iceberg_Docker_Environment
Knowledge Sources
Metadata
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment