Principle:TobikoData Sqlmesh Production Deployment With Backfill

Knowledge Sources	SQLMesh SQLMesh Docs
Domains	Data_Engineering, Incremental_Processing
Last Updated	2026-02-07 00:00 GMT

Overview

Deploy incremental model changes to production by computing and executing backfills for all missing or invalidated time intervals.

Description

Production deployment with backfill orchestrates the process of applying model changes to production environments by identifying which time intervals need processing and executing them in dependency order. Unlike simple deployments that only affect future data, backfill ensures that historical data gaps are filled and that changes propagate through the entire dependency graph.

The backfill process analyzes the current state of each model to determine which intervals have been completed, which are missing, and which need reprocessing due to upstream changes or restatement requests. It then generates an execution plan that respects model dependencies, cron schedules, and resource constraints while ensuring data consistency.

Backfills handle several scenarios: initial deployments where no historical data exists, incremental deployments where recent intervals are missing, and restatements where previously processed intervals must be recalculated due to corrections in source data or logic changes.

The system provides safety mechanisms including preview capabilities to estimate backfill cost, the ability to skip backfill for forward-only changes, and empty backfill mode that records intervals as complete without actual processing.

Usage

Use production deployment with backfill when rolling out new incremental models or deploying changes that require historical data processing. Specify the start and end dates to control the backfill window, limiting processing to recent intervals for performance or expanding to full history for comprehensive corrections.

Apply restatement parameters when upstream data sources have been corrected and downstream models need to reflect those corrections. The system automatically identifies all affected intervals and downstream dependencies that require reprocessing.

Use end_bounded flag to prevent backfills from extending beyond the specified end date due to lookback windows or partial interval handling, ensuring predictable resource consumption.

Configure selected_snapshots to limit backfill to specific models when deploying changes that affect only a subset of the dependency graph.

Theoretical Basis

Production deployment with backfill implements a multi-phase orchestration process:

BACKFILL_PLANNING:
  target_env = "prod"
  deployment_time = now()

  current_state = load_interval_state(target_env)
  model_versions = load_model_snapshots(target_env)

  FOR each model in deployment:
    model_start = coalesce(user_start, model.start_date, earliest_data)
    model_end = coalesce(user_end, deployment_time)

    expected_intervals = generate_intervals(
      model.cron,
      model_start,
      model_end,
      respect_cron = NOT ignore_cron
    )

    completed_intervals = current_state.get_completed(model)

    missing = expected_intervals - completed_intervals

    IF restatements contains model THEN
      restate_range = restatements[model]
      invalidated = completed_intervals ∩ restate_range
      missing = missing ∪ invalidated

      IF restate_all_snapshots THEN
        clear_intervals(all_versions(model), restate_range)

    IF end_bounded THEN
      missing = filter(missing, interval.end <= model_end)

    backfill_plan[model] = missing

DEPENDENCY_RESOLUTION:
  dag = build_dependency_graph(models)
  sorted_models = topological_sort(dag)

  FOR each model in sorted_models:
    FOR each interval in backfill_plan[model]:
      upstream_intervals = compute_required_upstream(
        model,
        interval,
        lookback=model.lookback
      )

      FOR each (upstream_model, upstream_interval) in upstream_intervals:
        IF upstream_interval NOT completed THEN
          add_dependency(interval, upstream_interval)

EXECUTION_ORCHESTRATION:
  ready_queue = intervals with no pending dependencies

  WHILE ready_queue not empty:
    batch = select_next_batch(ready_queue, batch_size, batch_concurrency)

    results = parallel_execute(batch):
      FOR each interval in batch:
        input_data = read_from_dependencies(interval)
        output_data = apply_transformation(input_data)

        physical_table = resolve_table_name(model, interval)

        IF interval.is_first_for_model THEN
          CREATE_OR_REPLACE_TABLE physical_table
        ELSE
          INSERT_OR_REPLACE INTO physical_table

        mark_completed(model, interval, deployment_time)

    FOR each completed_interval in results:
      downstream = find_dependent_intervals(completed_interval)
      ready_queue.add(filter_ready(downstream))

COMPLETION:
  validate_all_intervals_processed()
  update_environment_snapshot_versions()
  finalize_environment(target_env, deployment_time)

Critical aspects of backfill execution:

Gap Detection: Identifies missing intervals by comparing expected intervals (based on cron schedule) against completion state.

Restatement Propagation: When an interval is restated, automatically identifies all downstream models whose outputs depend on the restated data.

Batch Optimization: Groups consecutive intervals into batches to reduce overhead while respecting concurrency limits.

Atomic State Updates: Interval completion is recorded transactionally to prevent partial state in case of failures.

Idempotency: Rerunning a backfill safely processes only remaining incomplete intervals.

The system ensures that production data remains consistent throughout the backfill by updating the environment pointer only after all intervals complete successfully.

Related Pages

Implemented By

Implementation:TobikoData_Sqlmesh_Scheduler_Merged_Missing_Intervals

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment