Implementation:DataExpert io Data engineer handbook Chispa Assert df equality
Appearance
Overview
Type: Wrapper Doc (external chispa library)
This implementation documents the assert_df_equality function from the chispa library, which provides structured DataFrame comparison for PySpark unit tests. It compares two DataFrames by schema and data, with support for ignoring nullable differences.
Source
test_monthly_user_site_hits.py:L57test_player_scd.py:L24test_team_vertex_job.py:L32
Signature
assert_df_equality(
df1: DataFrame,
df2: DataFrame,
ignore_nullable: bool = False
) -> None
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
df1 |
DataFrame |
(required) | The actual DataFrame produced by the transformation under test |
df2 |
DataFrame |
(required) | The expected DataFrame constructed in the test |
ignore_nullable |
bool |
False |
When True, strips nullable flags from both schemas before comparison
|
Return Value
- Returns
Noneif the DataFrames are equal (test passes) - Raises
DataFramesNotEqualErrorif the DataFrames differ (test fails)
Import
from chispa.dataframe_comparer import assert_df_equality
I/O
- Inputs:
df1— the actual DataFrame (result of the transformation under test)df2— the expected DataFrame (hand-constructed in the test)ignore_nullable— boolean flag controlling nullable comparison
- Outputs:
None— the function returns nothing when DataFrames are equal (assertion passes)- Raises
DataFramesNotEqualError— when DataFrames differ in schema or data (assertion fails)
Usage Examples
Basic comparison
from chispa.dataframe_comparer import assert_df_equality
def test_transformation(spark):
actual_df = my_transformation(input_df)
expected_df = spark.createDataFrame(expected_data)
assert_df_equality(actual_df, expected_df)
Ignoring nullable differences
from chispa.dataframe_comparer import assert_df_equality
def test_transformation_ignore_nullable(spark):
actual_df = my_transformation(input_df)
expected_df = spark.createDataFrame(expected_data)
assert_df_equality(actual_df, expected_df, ignore_nullable=True)
Usage in Test Files
test_monthly_user_site_hits.py (Line 57)
assert_df_equality(actual_df, expected_df, ignore_nullable=True)
test_player_scd.py (Line 24)
assert_df_equality(actual_df, expected_df, ignore_nullable=True)
test_team_vertex_job.py (Line 32)
assert_df_equality(actual_df, expected_df, ignore_nullable=True)
All three test files use ignore_nullable=True to avoid false failures caused by nullable flag differences between hand-constructed and transformation-produced DataFrames.
External Reference
The chispa library is available on PyPI:
- Package: chispa on PyPI
- Purpose: PySpark test helper library providing DataFrame and column comparison utilities
- Installation:
pip install chispa
Related Pages
- Principle:DataExpert_io_Data_engineer_handbook_DataFrame_Equality_Assertion
- Environment:DataExpert_io_Data_engineer_handbook_Python_Development_Environment
- Heuristic:DataExpert_io_Data_engineer_handbook_SparkSession_Singleton_Pattern
Metadata
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment