Implementation:DataExpert io Data engineer handbook Namedtuple CreateDataFrame Pattern
Overview
Type: Pattern Doc
This implementation documents the namedtuple + createDataFrame pattern used across multiple test files to construct typed, schema-enforced test DataFrames in PySpark unit tests.
Source
test_monthly_user_site_hits.py:L4-7test_player_scd.py:L3-5test_team_vertex_job.py:L4-7
Interface
The pattern follows a consistent three-step process:
Step 1: Define a namedtuple
PlayerSeason = namedtuple("PlayerSeason", "player_name current_season scoring_class")
This defines a typed record with explicit field names that will become DataFrame column names.
Step 2: Create data as a list of namedtuple instances
input_data = [
PlayerSeason("Michael Jordan", 2001, 'Good'),
PlayerSeason("LeBron James", 2003, 'Great'),
]
Each element in the list represents one row. Field names enforce consistency across all rows.
Step 3: Create a DataFrame from the data
source_df = spark.createDataFrame(input_data)
Spark infers the schema from the namedtuple fields and value types, producing a fully typed DataFrame.
Examples from Test Files
test_monthly_user_site_hits.py
from collections import namedtuple
MonthlyUserSiteHit = namedtuple("MonthlyUserSiteHit", "user_id site_id month hit_count")
input_data = [
MonthlyUserSiteHit(1, 100, "2023-01", 5),
MonthlyUserSiteHit(1, 100, "2023-02", 10),
]
source_df = spark.createDataFrame(input_data)
test_player_scd.py
from collections import namedtuple
PlayerSeason = namedtuple("PlayerSeason", "player_name current_season scoring_class")
input_data = [
PlayerSeason("Michael Jordan", 2001, 'Good'),
PlayerSeason("Michael Jordan", 2002, 'Great'),
]
source_df = spark.createDataFrame(input_data)
test_team_vertex_job.py
from collections import namedtuple
TeamVertex = namedtuple("TeamVertex", "team_id team_name conference")
input_data = [
TeamVertex(1, "Lakers", "Western"),
TeamVertex(2, "Celtics", "Eastern"),
]
source_df = spark.createDataFrame(input_data)
Import
from collections import namedtuple
The namedtuple factory is part of Python's standard library collections module. No external dependencies are required.
I/O
- Inputs:
- A namedtuple class definition (field names as strings)
- A list of namedtuple instances (the test data rows)
- A
SparkSessioninstance (typically injected via thesparkfixture)
- Outputs:
- A PySpark
DataFramewith schema inferred from the namedtuple fields and value types
- A PySpark
Pattern Notes
- The same pattern is used to construct both input DataFrames and expected output DataFrames within each test function
- Different namedtuple definitions are used for input versus output when the transformation changes the schema
- The pattern keeps tests readable because each row is a self-documenting named record rather than an anonymous tuple
Related Pages
- Principle:DataExpert_io_Data_engineer_handbook_Test_Data_Construction
- Environment:DataExpert_io_Data_engineer_handbook_Python_Development_Environment
- Heuristic:DataExpert_io_Data_engineer_handbook_SparkSession_Singleton_Pattern