Principle:TobikoData Sqlmesh Plan Creation And Change Detection
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Deployment |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Plan creation and change detection is the process of comparing the current state of data transformation models against a target environment to identify what has changed, determine breaking vs. non-breaking changes, and generate a deployment plan specifying what backfills are required to bring the environment up to date.
Description
Modern data infrastructure borrowed from Infrastructure-as-Code tools like Terraform the concept of "plan and apply" workflows: before making any changes to production systems, generate a preview showing exactly what will change and what operations will be executed. In data transformation frameworks, this is more complex than infrastructure management because changes affect not just the structure of tables but the historical data they contain.
Plan creation involves computing fingerprints (content-based hashes) for each model based on its query, dependencies, and configuration. These fingerprints are compared against what's currently deployed in the target environment. When fingerprints differ, the framework must determine: Is this a breaking change (requiring reprocessing all historical data) or non-breaking (can safely start from now)? Should downstream models be reprocessed because upstream data changed? What time range of data needs backfilling?
The change detection algorithm categorizes changes as: direct changes (model query modified), indirect changes (upstream dependency changed), metadata changes (non-breaking modifications like descriptions), or forward-only changes (production deployments that only affect future data). Each category has different implications for what data needs reprocessing. The plan also handles restating models (reprocessing specific time ranges), handling gaps in incremental processing, and managing virtual environments where multiple development branches share physical data.
The result is a comprehensive plan that shows: which models changed and how, which models require backfilling and for what time ranges, what the impact is on downstream dependencies, and what database operations will be executed. Developers can review this plan, adjust backfill time ranges, and approve before any actual warehouse operations occur.
Usage
Plan creation is the mandatory step before any deployment in a plan/apply workflow. It should be used when: deploying code changes to production, creating isolated development environments for feature branches, updating production with new data requirements, investigating the impact of proposed changes before committing, restating historical data due to upstream corrections, and performing incremental deployments where only a subset of models are updated. The plan output serves as documentation for what changed and becomes an approval artifact in governance workflows. Development teams create plans frequently (many times per day), while production plans might be created less frequently but with more scrutiny.
Theoretical Basis
The core logic for plan creation and change detection follows this algorithm:
Environment Analysis:
- Load current project state: all models with their queries and configurations
- Retrieve target environment state: snapshots (versioned models) currently deployed
- If target environment doesn't exist, determine which environment to create from
- Load model selection criteria if only deploying subset of models
Fingerprint Computation:
- For each model in current state:
- Compute content hash of model query (normalized SQL)
- Include in fingerprint: model kind, dependencies, dialect, configuration
- Recursively incorporate upstream model fingerprints
- Generate unique snapshot identifier from fingerprint
Change Detection and Categorization:
- Compare current fingerprints against deployed snapshots
- Identify added models (new in current state, absent from environment)
- Identify removed models (present in environment, absent from current state)
- Identify modified models (fingerprint changed)
- For modified models, determine change category:
- Analyze query diff to classify as breaking vs. non-breaking
- Check if only metadata changed (non-breaking by definition)
- Determine if downstream models need indirect updates
Backfill Determination:
- For each changed model requiring data updates:
- Determine time range that needs processing based on model kind:
- FULL models: typically process entire dataset
- INCREMENTAL_BY_TIME_RANGE: process from change date or configured start
- SCD_TYPE_2: process from effective date
- Check for data gaps: missing intervals that should be filled
- Apply user overrides: explicit start/end dates, skip backfill, restate requests
- Calculate downstream impact: which dependent models need reprocessing
- Determine time range that needs processing based on model kind:
Forward-Only Change Handling:
- For production environments with forward-only setting:
- Restrict changes to only affect future data intervals
- Validate that breaking changes are explicitly allowed
- Set effective date from which new logic applies
Plan Assembly:
- Construct plan object containing:
- Context diff: detailed before/after comparison for each changed model
- Backfill requirements: models and time ranges to process
- Promotion steps: which snapshots to promote to environment
- Validation results: test outcomes, linter findings
- Categorize uncategorized changes if auto-categorization enabled
- Calculate plan statistics: total backfill cost, time estimates
User Interaction and Refinement:
- Display plan summary: changes categorized by type and impact
- Show SQL diffs for modified models
- Present backfill time ranges for user adjustment
- Collect user input for uncategorized changes
- Allow interactive modification of start/end dates
The algorithm balances safety (detecting all relevant changes) with performance (avoiding unnecessary reprocessing). It must handle complex scenarios like circular references resolved through virtual views, models that share physical storage, and changes that affect environment-specific configurations.