Principle:DataExpert io Data engineer handbook Test Data Construction
Overview
The Test Data Construction principle addresses the theory of constructing test DataFrames from typed Python objects. Rather than reading external files or connecting to databases, PySpark unit tests build small, deterministic DataFrames in-line using Python's namedtuple and Spark's createDataFrame() method. This approach ensures tests are self-contained, reproducible, and schema-enforced.
Theory of Constructing Test DataFrames from Typed Python Objects
Effective PySpark unit tests require both input DataFrames (fed into the transformation under test) and expected output DataFrames (compared against the actual result). These DataFrames should be:
- Small — only enough rows to exercise the logic under test
- Deterministic — identical across every test run
- Self-contained — no external dependencies (files, databases, APIs)
- Schema-enforced — column names and types are explicit, not inferred from ambiguous structures
Constructing DataFrames from typed Python objects satisfies all of these requirements.
Using Namedtuples for Schema Enforcement
Python's collections.namedtuple provides a lightweight mechanism for defining typed records with named fields:
- Each field in the namedtuple corresponds to a column in the resulting DataFrame
- The field names become column names
- The field values determine the inferred column types
- Misspelled or missing fields raise immediate Python errors, catching bugs before Spark is even involved
This gives test data a schema contract — each row conforms to an explicit structure, making tests more readable and less error-prone than using raw tuples or dictionaries.
spark.createDataFrame() for Converting Python Collections
The spark.createDataFrame() method accepts a list of namedtuples (or other Row-like objects) and converts them into a Spark DataFrame. The conversion process:
- Infers the schema from the namedtuple field names and value types
- Creates an immutable, distributed DataFrame suitable for Spark operations
- Works entirely in local mode without any external data sources
Input and Expected Output DataFrames
Both sides of a test assertion are constructed using the same pattern:
- Input DataFrame — represents the data fed into the transformation under test
- Expected Output DataFrame — represents the correct result after transformation
By defining both in the same way (namedtuple + createDataFrame), tests maintain symmetry and clarity. The reader can easily compare input and expected output side by side.
When to Apply
This principle applies when:
- Writing unit tests for PySpark transformations
- Constructing small, deterministic DataFrames for test assertions
- Schema consistency between test data and production data matters
- Tests should be self-contained with no external data dependencies