Principle:MaterializeInc Materialize Feature Performance Regression Detection
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Performance_Testing, Statistical_Analysis |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Feature performance regression detection is the practice of automatically identifying when a code change causes a measurable degradation in the performance of a specific database feature. Rather than relying on a single measurement, this principle employs a statistically rigorous pipeline: run repeated benchmark iterations under controlled termination conditions, filter outlier measurements, aggregate the remaining samples using a chosen statistical strategy, compare the aggregated result against a baseline version, and flag regressions when the degradation exceeds a configurable threshold.
Core Principles
Iterative Measurement with Statistical Termination
A single benchmark measurement is inherently noisy. This principle mandates running a scenario's measured workload repeatedly until a statistical termination condition is satisfied, rather than for a fixed number of iterations. Two primary termination strategies are used:
- Normal Distribution Overlap: After collecting more than 10 measurements, successive iterations are fit to normal distributions. When the overlap between consecutive fits exceeds a threshold (e.g., 99%), the distribution is considered stable and measurement stops. This ensures that enough data has been collected for the distribution to converge.
- Probability for Minimum: After collecting more than 5 measurements, the probability that a future measurement will be smaller than the current minimum is computed from the fitted normal distribution. When this probability drops below a threshold, the current minimum is considered reliable and measurement stops.
A hard upper bound (RunAtMost) serves as a safety net to prevent unbounded execution.
Outlier Filtering
Before measurements are passed to the aggregation step, outlier filtering removes data points that would skew the result. The primary strategy discards any measurement exceeding one standard deviation above the running mean (after at least three data points have been collected). An alternative strategy discards the first measurement of each run to eliminate cold-start effects.
Aggregation Strategy
Filtered measurements are aggregated into a single representative value. The choice of aggregation strategy depends on the use case:
- Minimum: Returns the best (fastest) observed measurement, suitable when the goal is to measure peak performance.
- Mean: Returns the average, suitable for general-purpose comparison.
- Standard Deviation Adjusted: Returns the mean minus N standard deviations, providing a conservative estimate that accounts for variance.
- Normal Distribution: Fits a
NormalDistto the data, preserving both location and spread for probabilistic comparison.
Relative Threshold Regression Detection
Regression detection compares the aggregated measurement from the current version ("this") against a baseline version ("other") by computing their ratio. A regression is flagged when the ratio exceeds 1 + threshold, where the threshold is configurable per measurement type:
- Wallclock time: 10% threshold (flagged if more than 10% slower)
- Mz memory: 20% threshold (flagged if more than 20% higher memory usage)
- Clusterd memory: 50% threshold (flagged if more than 50% higher memory usage)
A strong regression uses double the standard threshold (e.g., 20% for wallclock), providing a two-tier severity classification.
Multi-Cycle Report Selection
To further reduce noise, the entire benchmark suite can be run across multiple cycles. Two selection strategies determine which cycle's report is canonical for each scenario:
- Median selection: Picks the report whose wallclock value is the median across all cycles, providing a robust central estimate.
- Best selection: Picks the report with the minimum wallclock value, preferring reports that show no regressions. This favors the best-case performance while avoiding false positives.
Multi-Dimensional Measurement
Each benchmark scenario collects measurements along three dimensions simultaneously: wallclock execution time, materialized process memory, and clusterd process memory. This ensures that a code change that improves latency but significantly increases memory consumption is still flagged, and vice versa.
Scenario Lifecycle
The benchmark framework enforces a structured scenario lifecycle that ensures consistent, reproducible measurements:
- shared(): One-time global setup (e.g., creating shared infrastructure) that runs only for the first Mz instance.
- init(): Per-instance initialization (e.g., creating tables, loading data) that runs once per Mz under test.
- before(): Pre-measurement setup that runs before every measurement iteration (e.g., resetting state).
- benchmark(): The measured workload, which must produce exactly two timestamp markers (A and B). The wallclock measurement is the difference between these markers.
Rationale
Database performance is sensitive to many variables: OS scheduling, cache state, background processes, and load conditions. A single benchmark run is insufficient for reliable regression detection. By combining statistical termination, outlier filtering, configurable aggregation, and multi-cycle selection, this approach minimizes both false positives (flagging noise as a regression) and false negatives (missing a real regression hidden by noise). The configurable thresholds per measurement type acknowledge that different metrics have different inherent variance and different tolerance levels for degradation.