Principle:DataExpert io Data engineer handbook DataFrame Equality Assertion
Overview
The DataFrame Equality Assertion principle addresses the theory of comparing actual versus expected DataFrames in PySpark unit tests. A correct assertion must verify both the schema (column names and types) and the data (row values) match between two DataFrames. This principle also covers the handling of nullable differences, which frequently arise when comparing DataFrames constructed from different sources.
Theory of DataFrame Comparison for Testing
In PySpark testing, the fundamental assertion pattern is:
- Apply a transformation to an input DataFrame to produce an actual result
- Construct an expected DataFrame representing the correct output
- Assert that the actual and expected DataFrames are equal
DataFrame equality is more nuanced than scalar equality. Two DataFrames are considered equal when:
- They have the same schema — identical column names, data types, and (optionally) nullable flags
- They contain the same rows — identical values in every column for every row
- Row order may or may not matter depending on the assertion semantics
Comparing Actual vs Expected DataFrames
A robust DataFrame comparison must check multiple dimensions:
| Dimension | What is Compared | Failure Indicates |
|---|---|---|
| Column names | Names match in order | Schema mismatch |
| Column types | Data types match | Type inference or casting error |
| Nullable flags | Nullable attributes match | Nullable mismatch (often benign) |
| Row data | All row values match | Transformation logic error |
| Row count | Same number of rows | Missing or duplicate rows |
Simple equality checks (e.g., df1.collect() == df2.collect()) are fragile because they do not handle schema differences, type coercion, or nullable flags. Purpose-built assertion libraries provide structured comparison with clear error messages.
Handling Nullable Differences
One of the most common sources of false test failures in PySpark is nullable mismatches. When Spark infers a schema from Python data via createDataFrame(), it may set nullable flags differently than the schema produced by a transformation.
For example:
- A column created from a Python list of non-null integers may be inferred as
nullable = true - The same column produced by a transformation may be
nullable = false
These differences are typically semantically irrelevant to the correctness of the transformation. A good assertion framework provides an ignore_nullable parameter that strips nullable flags from both schemas before comparison, allowing tests to focus on the data and types that actually matter.
When to Apply
This principle applies when:
- Asserting PySpark transformation correctness in unit tests
- Comparing an actual DataFrame output against a hand-constructed expected DataFrame
- Nullable mismatches cause false failures that obscure real logic errors
- Clear, structured error messages are needed to diagnose test failures