Implementation:DataExpert io Data engineer handbook Do player scd transformation
Appearance
Overview
This page documents the do_player_scd_transformation function, which implements a Slowly Changing Dimension (SCD) Type 2 transformation on player scoring data. The function detects changes in a player's scoring class across seasons and produces date-ranged records for each contiguous period.
Type
API Doc
Source
players_scd_job.py:L1-54 (full file)
Signature
def do_player_scd_transformation(spark, dataframe) -> DataFrame
Import
from src.jobs.players_scd_job import do_player_scd_transformation
Inputs / Outputs
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | spark | SparkSession | An active SparkSession instance for executing SQL |
| Input | dataframe | DataFrame | A DataFrame containing columns: player_name, current_season, scoring_class
|
| Output | result | DataFrame | A DataFrame containing columns: player_name, scoring_class, start_date, end_date
|
SQL Query Structure
The function registers the input DataFrame as a temporary view and executes a SQL query with three CTEs:
- streak_started - Uses
LAG(scoring_class) OVER (PARTITION BY player_name ORDER BY current_season)to detect when a player's scoring class changes. Produces achange_flagcolumn (1 if changed, 0 otherwise). - streak_identified - Uses
SUM(change_flag) OVER (PARTITION BY player_name ORDER BY current_season)to assign a uniquestreak_idto each contiguous group of unchanged scoring class values. - aggregated - Uses
MIN(current_season)andMAX(current_season)grouped byplayer_name,scoring_class, andstreak_idto produce the final start and end dates for each SCD record.
# Usage example
spark = SparkSession.builder.master("local").appName("players_scd").getOrCreate()
input_df = spark.read.table("players_raw")
output_df = do_player_scd_transformation(spark, input_df)
output_df.write.mode("overwrite").insertInto("players_scd")
Related Pages
- Principle:DataExpert_io_Data_engineer_handbook_SCD_Type_2_Transformation
- Environment:DataExpert_io_Data_engineer_handbook_Spark_Iceberg_Docker_Environment
Knowledge Sources
Metadata
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment