Implementation:DataExpert io Data engineer handbook Chispa Assert df equality

Overview

Type: Wrapper Doc (external chispa library)

This implementation documents the assert_df_equality function from the chispa library, which provides structured DataFrame comparison for PySpark unit tests. It compares two DataFrames by schema and data, with support for ignoring nullable differences.

Source

test_monthly_user_site_hits.py:L57
test_player_scd.py:L24
test_team_vertex_job.py:L32

Signature

assert_df_equality(
    df1: DataFrame,
    df2: DataFrame,
    ignore_nullable: bool = False
) -> None

Parameters

Parameter	Type	Default	Description
`df1`	`DataFrame`	(required)	The actual DataFrame produced by the transformation under test
`df2`	`DataFrame`	(required)	The expected DataFrame constructed in the test
`ignore_nullable`	`bool`	`False`	When `True`, strips nullable flags from both schemas before comparison

Return Value

Returns None if the DataFrames are equal (test passes)
Raises DataFramesNotEqualError if the DataFrames differ (test fails)

Import

from chispa.dataframe_comparer import assert_df_equality

I/O

Inputs:
- df1 — the actual DataFrame (result of the transformation under test)
- df2 — the expected DataFrame (hand-constructed in the test)
- ignore_nullable — boolean flag controlling nullable comparison
Outputs:
- None — the function returns nothing when DataFrames are equal (assertion passes)
- Raises DataFramesNotEqualError — when DataFrames differ in schema or data (assertion fails)

Usage Examples

Basic comparison

from chispa.dataframe_comparer import assert_df_equality

def test_transformation(spark):
    actual_df = my_transformation(input_df)
    expected_df = spark.createDataFrame(expected_data)
    assert_df_equality(actual_df, expected_df)

Ignoring nullable differences

from chispa.dataframe_comparer import assert_df_equality

def test_transformation_ignore_nullable(spark):
    actual_df = my_transformation(input_df)
    expected_df = spark.createDataFrame(expected_data)
    assert_df_equality(actual_df, expected_df, ignore_nullable=True)

Usage in Test Files

test_monthly_user_site_hits.py (Line 57)

assert_df_equality(actual_df, expected_df, ignore_nullable=True)

test_player_scd.py (Line 24)

assert_df_equality(actual_df, expected_df, ignore_nullable=True)

test_team_vertex_job.py (Line 32)

assert_df_equality(actual_df, expected_df, ignore_nullable=True)

All three test files use ignore_nullable=True to avoid false failures caused by nullable flag differences between hand-constructed and transformation-produced DataFrames.

External Reference

The chispa library is available on PyPI:

Package: chispa on PyPI
Purpose: PySpark test helper library providing DataFrame and column comparison utilities
Installation: pip install chispa

Related Pages

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment