Implementation:DataExpert io Data engineer handbook Do monthly user site hits transformation
Appearance
Overview
This page documents the do_monthly_user_site_hits_transformation function, which aggregates array-based hit data into per-day summaries for a given date partition. The function extracts individual elements from hit arrays and sums them across users grouped by month.
Type
API Doc
Source
monthly_user_site_hits_job.py:L7-19
Signature
def do_monthly_user_site_hits_transformation(spark, dataframe, ds) -> DataFrame
Import
from src.jobs.monthly_user_site_hits_job import do_monthly_user_site_hits_transformation
Inputs / Outputs
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | spark | SparkSession | An active SparkSession instance for executing SQL |
| Input | dataframe | DataFrame | A DataFrame containing columns: month_start, hit_array, date_partition
|
| Input | ds | str | The date string used to filter the date_partition column for incremental processing
|
| Output | result | DataFrame | A DataFrame containing columns: month_start, num_hits_first_day, num_hits_second_day, num_hits_third_day
|
Transformation Logic
The function performs the following steps:
- Registers the input DataFrame as a temporary view
- Filters rows where
date_partitionmatches the provideddsparameter - Extracts elements from
hit_arrayat positions 0, 1, and 2 using theget()function - Wraps each extracted value in
COALESCE(..., 0)to handle nulls - Applies
SUM()aggregation on each extracted element - Groups results by
month_start
# Usage example
spark = SparkSession.builder.master("local").appName("monthly_hits").getOrCreate()
input_df = spark.read.table("user_site_hits")
output_df = do_monthly_user_site_hits_transformation(spark, input_df, "2024-01-01")
output_df.write.mode("overwrite").insertInto("monthly_user_site_hits")
Related Pages
- Principle:DataExpert_io_Data_engineer_handbook_Aggregation_Transformation
- Environment:DataExpert_io_Data_engineer_handbook_Spark_Iceberg_Docker_Environment
Knowledge Sources
Metadata
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment