Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataExpert io Data engineer handbook Test Data Construction

From Leeroopedia


Overview

The Test Data Construction principle addresses the theory of constructing test DataFrames from typed Python objects. Rather than reading external files or connecting to databases, PySpark unit tests build small, deterministic DataFrames in-line using Python's namedtuple and Spark's createDataFrame() method. This approach ensures tests are self-contained, reproducible, and schema-enforced.

Theory of Constructing Test DataFrames from Typed Python Objects

Effective PySpark unit tests require both input DataFrames (fed into the transformation under test) and expected output DataFrames (compared against the actual result). These DataFrames should be:

  • Small — only enough rows to exercise the logic under test
  • Deterministic — identical across every test run
  • Self-contained — no external dependencies (files, databases, APIs)
  • Schema-enforced — column names and types are explicit, not inferred from ambiguous structures

Constructing DataFrames from typed Python objects satisfies all of these requirements.

Using Namedtuples for Schema Enforcement

Python's collections.namedtuple provides a lightweight mechanism for defining typed records with named fields:

  • Each field in the namedtuple corresponds to a column in the resulting DataFrame
  • The field names become column names
  • The field values determine the inferred column types
  • Misspelled or missing fields raise immediate Python errors, catching bugs before Spark is even involved

This gives test data a schema contract — each row conforms to an explicit structure, making tests more readable and less error-prone than using raw tuples or dictionaries.

spark.createDataFrame() for Converting Python Collections

The spark.createDataFrame() method accepts a list of namedtuples (or other Row-like objects) and converts them into a Spark DataFrame. The conversion process:

  • Infers the schema from the namedtuple field names and value types
  • Creates an immutable, distributed DataFrame suitable for Spark operations
  • Works entirely in local mode without any external data sources

Input and Expected Output DataFrames

Both sides of a test assertion are constructed using the same pattern:

  • Input DataFrame — represents the data fed into the transformation under test
  • Expected Output DataFrame — represents the correct result after transformation

By defining both in the same way (namedtuple + createDataFrame), tests maintain symmetry and clarity. The reader can easily compare input and expected output side by side.

When to Apply

This principle applies when:

  • Writing unit tests for PySpark transformations
  • Constructing small, deterministic DataFrames for test assertions
  • Schema consistency between test data and production data matters
  • Tests should be self-contained with no external data dependencies

Related Pages

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment