Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:TobikoData Sqlmesh Unit Test Creation

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Model_Development, Testing
Last Updated 2026-02-07 00:00 GMT

Overview

An automated approach to generating test fixtures by capturing real data samples and expected outputs, enabling regression testing and validation of data transformation logic.

Description

Unit testing SQL transformations presents unique challenges compared to traditional software testing. SQL models often have complex dependencies on upstream tables, use database-specific functions, and produce large result sets. Writing test fixtures manually requires creating sample input data for all dependencies and defining expected outputs—a tedious and error-prone process that discourages comprehensive testing.

Automated test generation solves this by leveraging existing data in development or staging environments. Engineers specify queries that retrieve representative samples from upstream dependencies, and the framework executes the model against this real data to capture the actual output. This output becomes the expected result for future test runs. The generated test fixture is self-contained, including all input data and expected outputs in a format that can run quickly without database access.

This approach dramatically lowers the barrier to testing. Instead of manually crafting CSV files or SQL insert statements, engineers simply identify appropriate source data and generate fixtures automatically. The tests serve multiple purposes: regression detection when refactoring, validation during code review, fast feedback during development, and documentation showing concrete examples of model behavior with real data.

Usage

Use automated test generation when developing new models, refactoring existing transformations, fixing bugs, or documenting expected model behavior. Generate tests after validating that a model produces correct results in a development environment, then use those tests for fast iteration and regression detection in CI/CD pipelines.

Theoretical Basis

The test generation algorithm follows a capture-and-replay pattern:

FUNCTION generate_test(model, input_queries, test_name):
    # Execute input queries to capture upstream data
    input_fixtures = {}
    FOR EACH dependency, query IN input_queries:
        result = execute_query(query, production_engine)
        input_fixtures[dependency] = serialize_to_fixture(result)
    END FOR

    # Optionally capture CTE data for better debugging
    IF include_ctes THEN
        cte_fixtures = capture_intermediate_ctes(model, input_fixtures)
        input_fixtures = merge(input_fixtures, cte_fixtures)
    END IF

    # Execute model with captured inputs
    test_engine = create_in_memory_engine()
    load_fixtures_into_test_engine(input_fixtures, test_engine)

    # Capture model output as expected result
    output = execute_model(model, test_engine)
    expected_fixture = serialize_to_fixture(output)

    # Generate test file in YAML format
    test_definition = {
        name: test_name,
        model: model.name,
        inputs: input_fixtures,
        outputs: {
            query: expected_fixture
        }
    }

    # Write to project test directory
    test_path = resolve_test_path(project_path, model, test_name)
    write_yaml(test_path, test_definition)

    RETURN test_path
END FUNCTION

The key insight is separating test execution into two phases: generation (capture) and validation (replay). During generation, real data flows through the model to establish ground truth. During validation, the same inputs are replayed against the model to detect any deviations from expected behavior.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment