Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataTalksClub Data engineering zoomcamp Dbt Testing And Documentation

From Leeroopedia


Page Metadata
Knowledge Sources dbt data tests documentation, dbt documentation generation, analytics engineering best practices
Domains Analytics Engineering, Data Quality, Data Testing, Documentation
Last Updated 2026-02-09 14:00 GMT

Overview

Data quality testing and auto-generated documentation are integral parts of the analytics transformation pipeline, ensuring that data contracts are enforced and a searchable data catalog is maintained alongside the transformation code.

Description

The principle of testing and documentation as code holds that data quality assertions and metadata descriptions should live alongside the transformation logic, not in separate systems. This approach provides:

  • Schema tests: Declarative assertions on column properties (uniqueness, not-null, accepted values, referential integrity) that run as SQL queries against the built data.
  • Custom tests: Reusable test macros from packages (e.g., dbt_utils.unique_combination_of_columns) that validate complex constraints beyond single-column checks.
  • Model contracts: Enforced column-level type declarations that cause builds to fail if the actual output schema does not match the declared contract.
  • Auto-generated documentation: A static website generated from schema YAML files that provides a searchable data catalog with column descriptions, lineage graphs, and test results.

Testing and documentation together form a data contract between the transformation layer and its consumers. Tests validate that the contract is being honored at runtime, while documentation makes the contract discoverable and understandable.

Usage

Use the testing and documentation pattern when:

  • Data quality guarantees must be enforced before downstream consumers access the data.
  • Primary key uniqueness and not-null constraints must be validated (databases often do not enforce these in analytical warehouses).
  • Referential integrity between fact and dimension tables must be verified.
  • Enumerated values (e.g., service_type must be 'Green' or 'Yellow') must be constrained.
  • A browsable data catalog is needed for analysts, stakeholders, or new team members.
  • Compliance or audit requirements demand documented data lineage.

Theoretical Basis

Test Categories

Data tests in analytics engineering fall into distinct categories:

TEST TYPE              | VALIDATES                     | EXAMPLE
-----------------------|-------------------------------|--------------------------------
not_null               | Column has no NULL values      | trip_id IS NOT NULL
unique                 | Column has no duplicates       | COUNT(DISTINCT trip_id) = COUNT(*)
accepted_values        | Column values in allowed set   | service_type IN ('Green', 'Yellow')
relationships          | Foreign key references exist   | pickup_location_id EXISTS IN dim_zones
unique_combination     | Composite key is unique        | (zone, month, service_type) is unique
contract enforcement   | Output schema matches declared | fct_trips columns match schema.yml types

Test Execution Model

Each test compiles to a SQL query that returns failing rows. A test passes if zero rows are returned:

function execute_test(test_definition, model):
    if test_definition.type == "not_null":
        query = "SELECT * FROM {model} WHERE {column} IS NULL"
    elif test_definition.type == "unique":
        query = "SELECT {column}, COUNT(*) FROM {model}
                 GROUP BY {column} HAVING COUNT(*) > 1"
    elif test_definition.type == "accepted_values":
        query = "SELECT * FROM {model}
                 WHERE {column} NOT IN ({allowed_values})"
    elif test_definition.type == "relationships":
        query = "SELECT * FROM {model}
                 WHERE {column} NOT IN (SELECT {field} FROM {to})"

    results = execute(query)
    if results.row_count == 0:
        return PASS
    else:
        return FAIL(results)

Model Contract Enforcement

Model contracts extend beyond runtime tests to compile-time schema validation:

function enforce_contract(model, schema_definition):
    for each column in schema_definition.columns:
        if column not in model.output_columns:
            raise ERROR("Missing column: {column.name}")
        if column.data_type != model.output_columns[column.name].type:
            raise ERROR("Type mismatch: {column.name} expected {column.data_type}")

    -- At runtime, declared data_tests are also executed
    for each test in schema_definition.data_tests:
        execute_test(test, model)

Documentation Generation

Documentation is generated from two sources:

function generate_docs(project):
    -- Source 1: Schema YAML files (descriptions, column docs)
    for each schema_file in project.model_paths:
        extract model descriptions
        extract column descriptions
        extract test definitions

    -- Source 2: SQL model files (compiled queries, config blocks)
    for each model_file in project.model_paths:
        extract compiled SQL
        extract config (materialization, tags, etc.)

    -- Build lineage graph from ref() and source() calls
    lineage = build_dag(all_refs, all_sources)

    -- Output: static HTML site with search, lineage visualization, and column docs
    return render_catalog(descriptions, lineage, test_results)

The resulting catalog is a static website that can be served from any web server, providing a zero-infrastructure documentation solution.

Testing Pyramid for Data

Similar to the software testing pyramid, data testing has layers of increasing scope and cost:

                    /\
                   /  \   Contract tests (schema enforcement)
                  /    \
                 /------\
                /        \   Relationship tests (referential integrity)
               /          \
              /------------\
             /              \   Column tests (not_null, unique, accepted_values)
            /                \
           /------------------\
          /                    \   Source freshness checks
         /______________________\

The base of the pyramid (source freshness and column tests) should run on every build, while more expensive relationship and contract tests can run on a schedule or in CI.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment