Principle:DataExpert io Data engineer handbook Test Data Construction

Overview

The Test Data Construction principle addresses the theory of constructing test DataFrames from typed Python objects. Rather than reading external files or connecting to databases, PySpark unit tests build small, deterministic DataFrames in-line using Python's namedtuple and Spark's createDataFrame() method. This approach ensures tests are self-contained, reproducible, and schema-enforced.

Theory of Constructing Test DataFrames from Typed Python Objects

Effective PySpark unit tests require both input DataFrames (fed into the transformation under test) and expected output DataFrames (compared against the actual result). These DataFrames should be:

Small — only enough rows to exercise the logic under test
Deterministic — identical across every test run
Self-contained — no external dependencies (files, databases, APIs)
Schema-enforced — column names and types are explicit, not inferred from ambiguous structures

Constructing DataFrames from typed Python objects satisfies all of these requirements.

Using Namedtuples for Schema Enforcement

Python's collections.namedtuple provides a lightweight mechanism for defining typed records with named fields:

Each field in the namedtuple corresponds to a column in the resulting DataFrame
The field names become column names
The field values determine the inferred column types
Misspelled or missing fields raise immediate Python errors, catching bugs before Spark is even involved

This gives test data a schema contract — each row conforms to an explicit structure, making tests more readable and less error-prone than using raw tuples or dictionaries.

spark.createDataFrame() for Converting Python Collections

The spark.createDataFrame() method accepts a list of namedtuples (or other Row-like objects) and converts them into a Spark DataFrame. The conversion process:

Infers the schema from the namedtuple field names and value types
Creates an immutable, distributed DataFrame suitable for Spark operations
Works entirely in local mode without any external data sources

Input and Expected Output DataFrames

Both sides of a test assertion are constructed using the same pattern:

Input DataFrame — represents the data fed into the transformation under test
Expected Output DataFrame — represents the correct result after transformation

By defining both in the same way (namedtuple + createDataFrame), tests maintain symmetry and clarity. The reader can easily compare input and expected output side by side.

When to Apply

This principle applies when:

Writing unit tests for PySpark transformations
Constructing small, deterministic DataFrames for test assertions
Schema consistency between test data and production data matters
Tests should be self-contained with no external data dependencies

Related Pages

Implementation:DataExpert_io_Data_engineer_handbook_Namedtuple_CreateDataFrame_Pattern

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment