Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:TobikoData Sqlmesh Backfill Strategy Selection

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Deployment
Last Updated 2026-02-07 00:00 GMT

Overview

Backfill strategy selection is the process of determining the optimal time range for reprocessing historical data when deploying changes to data transformation models, balancing correctness requirements against computational cost and processing time.

Description

When data transformation models change, a critical question emerges: how much historical data needs to be reprocessed? Reprocessing everything ensures correctness but may be prohibitively expensive—imagine reprocessing years of daily aggregations for a minor calculation fix. Processing nothing is fast but risks serving incorrect results for historical queries. The backfill strategy determines the time window that strikes the appropriate balance.

Different scenarios demand different strategies. A bug fix affecting calculations should ideally reprocess all affected historical data. A new column can often start processing only from now forward. A change to handle a new edge case might only need to reprocess data from when that case first appeared. Incremental models (processing data in time-based chunks) require careful consideration of interval boundaries—processing must align with the model's grain (hourly, daily, monthly) and account for late-arriving data.

The framework must support various backfill patterns: full backfill (reprocess from earliest available date to present), partial backfill (reprocess specific date range), forward-only (no historical reprocessing, changes apply to future data only), and selective backfill (reprocess only models matching certain criteria). Cost considerations are paramount—a full backfill of years of data across hundreds of models could run for hours and cost thousands of dollars in warehouse compute.

Sophisticated frameworks allow dynamic adjustment of backfill ranges per model based on change type, provide cost estimation before execution, support pausing and resuming long-running backfills, and enable "what-if" analysis to compare different strategies. The strategy also interacts with data gaps—should missing historical intervals be filled, or should only new intervals be processed?

Usage

Backfill strategy selection should occur during the plan review phase, after changes are detected but before execution begins. Use full backfills when correctness is paramount (bug fixes, regulatory compliance changes). Use partial backfills when changes only affect recent data or when full reprocessing is cost-prohibitive. Use forward-only when deploying to production with changes that shouldn't alter historical reporting. Adjust strategies based on: model importance (critical reporting models warrant full backfills), data volume (large tables may justify shorter ranges), downstream impact (changes affecting many dependencies need careful scoping), and business requirements (month-end close may require complete accuracy for the quarter). Teams often start with automated defaults and then manually refine ranges for specific models.

Theoretical Basis

The core logic for backfill strategy selection follows this algorithm:

Context Assessment:

  1. Identify model change type: breaking, non-breaking, metadata-only
  2. Determine model kind: FULL, INCREMENTAL_BY_TIME_RANGE, SCD_TYPE_2, VIEW
  3. Analyze environment type: production vs. development
  4. Check for explicit user-provided start/end dates
  5. Review forward-only flags and deployment policies

Default Range Calculation:

  1. For each modified model requiring backfill:
    1. If FULL model: typically entire dataset (no time range applicable)
    2. If INCREMENTAL_BY_TIME_RANGE model:
      1. Start: earliest of (model creation date, change effective date, user-specified start)
      2. End: latest of (current date, user-specified end)
      3. Consider model's lookback period (how far back queries reference)
    3. If VIEW or external model: no backfill needed (just metadata update)
    4. If SCD_TYPE_2 model: determine effective date range for slowly changing dimensions

Gap Analysis:

  1. Query state backend for existing intervals already processed
  2. Identify gaps: time periods within desired range that lack data
  3. Determine if gaps should be filled based on:
    1. no_gaps flag: enforce complete interval coverage
    2. Gap size: small gaps may auto-fill, large gaps may require explicit approval
    3. Model configuration: some models intentionally have gaps

Cost and Time Estimation:

  1. For incremental models: count intervals to process (hours, days, months)
  2. Estimate compute time per interval based on model history
  3. Calculate total expected duration and cost
  4. Flag expensive backfills for user review

User Override Integration:

  1. Apply explicit start date override: limit how far back processing goes
  2. Apply explicit end date override: control forward boundary
  3. Process skip_backfill flag: mark intervals as done without processing
  4. Process empty_backfill flag: record intervals as complete but skip actual execution
  5. Handle restate_models directive: force reprocessing even if data exists

Forward-Only Adjustments:

  1. If forward-only mode enabled:
    1. Set start to effective_from date (no historical reprocessing)
    2. Validate that changes are permitted in forward-only (non-breaking or approved)
    3. Adjust downstream models to also start from effective date

Per-Model Range Finalization:

  1. For each model in deployment:
    1. Calculate final [start, end] interval
    2. Align boundaries to model's time grain (snap to day/hour/month boundaries)
    3. Validate range feasibility: end >= start, range within data availability
    4. Generate list of discrete intervals to process
    5. Store range in plan object for execution phase

Interactive Refinement:

  1. Present calculated ranges to user in plan output
  2. Provide interface to adjust start/end per model
  3. Recalculate estimates when user modifies ranges
  4. Validate modified ranges don't violate constraints
  5. Update plan with final approved ranges

The algorithm prioritizes safety by defaulting to broader time ranges, but provides clear controls for users to optimize based on their specific context. It must handle edge cases like models that reference unbounded historical data, timezone considerations for interval boundaries, and coordinating ranges across dependent models (downstream models must cover at least the range of upstream changes).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment