Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:DataExpert io Data engineer handbook Chispa Assert df equality

From Leeroopedia


Overview

Type: Wrapper Doc (external chispa library)

This implementation documents the assert_df_equality function from the chispa library, which provides structured DataFrame comparison for PySpark unit tests. It compares two DataFrames by schema and data, with support for ignoring nullable differences.

Source

  • test_monthly_user_site_hits.py:L57
  • test_player_scd.py:L24
  • test_team_vertex_job.py:L32

Signature

assert_df_equality(
    df1: DataFrame,
    df2: DataFrame,
    ignore_nullable: bool = False
) -> None

Parameters

Parameter Type Default Description
df1 DataFrame (required) The actual DataFrame produced by the transformation under test
df2 DataFrame (required) The expected DataFrame constructed in the test
ignore_nullable bool False When True, strips nullable flags from both schemas before comparison

Return Value

  • Returns None if the DataFrames are equal (test passes)
  • Raises DataFramesNotEqualError if the DataFrames differ (test fails)

Import

from chispa.dataframe_comparer import assert_df_equality

I/O

  • Inputs:
    • df1 — the actual DataFrame (result of the transformation under test)
    • df2 — the expected DataFrame (hand-constructed in the test)
    • ignore_nullable — boolean flag controlling nullable comparison
  • Outputs:
    • None — the function returns nothing when DataFrames are equal (assertion passes)
    • Raises DataFramesNotEqualError — when DataFrames differ in schema or data (assertion fails)

Usage Examples

Basic comparison

from chispa.dataframe_comparer import assert_df_equality

def test_transformation(spark):
    actual_df = my_transformation(input_df)
    expected_df = spark.createDataFrame(expected_data)
    assert_df_equality(actual_df, expected_df)

Ignoring nullable differences

from chispa.dataframe_comparer import assert_df_equality

def test_transformation_ignore_nullable(spark):
    actual_df = my_transformation(input_df)
    expected_df = spark.createDataFrame(expected_data)
    assert_df_equality(actual_df, expected_df, ignore_nullable=True)

Usage in Test Files

test_monthly_user_site_hits.py (Line 57)

assert_df_equality(actual_df, expected_df, ignore_nullable=True)

test_player_scd.py (Line 24)

assert_df_equality(actual_df, expected_df, ignore_nullable=True)

test_team_vertex_job.py (Line 32)

assert_df_equality(actual_df, expected_df, ignore_nullable=True)

All three test files use ignore_nullable=True to avoid false failures caused by nullable flag differences between hand-constructed and transformation-produced DataFrames.

External Reference

The chispa library is available on PyPI:

  • Package: chispa on PyPI
  • Purpose: PySpark test helper library providing DataFrame and column comparison utilities
  • Installation: pip install chispa

Related Pages

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment