Principle:Duckdb Duckdb Performance Regression Detection

Overview

Performance Regression Detection is the principle of detecting performance degradation by comparing benchmark timings across versions. When code changes are introduced, the same set of benchmarks is run on both the old (baseline) and new (candidate) versions, and the resulting timing distributions are compared to determine whether any benchmark has regressed beyond an acceptable threshold. This provides an automated safety net against unintentional performance degradation.

Description

The core idea behind Performance Regression Detection is statistical timing comparison between two sets of benchmark results. Rather than relying on a single run, benchmarks are executed multiple times to produce a distribution of timings, and a representative statistic (typically the median) is used for comparison.

Key aspects of this principle include:

Statistical timing comparison -- Benchmarks produce multiple timing samples. Comparing single runs would be unreliable due to system noise, so the comparison uses aggregate statistics (median, mean, or percentiles) to reduce the impact of outliers and measurement variance.
Regression thresholds -- A regression is declared when the new timing exceeds the old timing by more than a defined threshold. Thresholds are typically expressed as both a relative percentage (e.g., 10% slower) and an absolute minimum difference (e.g., >0.01 seconds) to avoid false positives on very fast benchmarks where small absolute differences can appear as large percentage changes.
Median-based comparison -- Using the median rather than the mean provides robustness against outlier runs (e.g., a single run affected by a garbage collection pause or OS scheduling anomaly). The median is the preferred central tendency measure for benchmark timing distributions.
Dual-threshold gating -- Both a relative threshold (percentage increase) and an absolute threshold (minimum time difference) must be exceeded for a regression to be reported. This dual-gating approach prevents noise on sub-millisecond benchmarks from triggering false alarms.
Automated comparison in CI/CD -- Regression detection is integrated into continuous integration pipelines so that every pull request or commit is automatically checked for performance regressions before merging.

Usage

Apply Performance Regression Detection when:

After running benchmarks on new code, to compare against a known baseline and determine if any benchmark has degraded.
As a gate in CI/CD pipelines to prevent merging code that causes performance regressions.
During development, to validate that optimization work has improved (or at least not degraded) performance.
When comparing performance across different branches, releases, or configurations.

Theoretical Basis

This principle draws on several foundational concepts:

Statistical comparison of run distributions -- Performance measurements are inherently noisy. Comparing distributions rather than single values accounts for variance in execution time caused by system-level factors (CPU scheduling, memory allocation, thermal throttling).
Threshold-based regression detection -- Defining explicit, quantitative thresholds for what constitutes a regression converts a subjective judgment ("is this slower?") into an objective, automatable check. The dual-threshold approach (relative and absolute) is a well-known technique for reducing false positive rates.
Robust statistics -- The median is a robust estimator of central tendency, meaning it is not unduly influenced by extreme values. This makes it preferable to the mean for benchmark timing data, which often exhibits right-skewed distributions with occasional outlier runs.
Continuous performance monitoring -- The practice of running benchmarks on every code change and comparing against a baseline, similar to continuous testing but focused on non-functional (performance) requirements.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment