Principle:MaterializeInc Materialize Scalability Regression Testing
Purpose
Scalability Regression Testing is the practice of continuously validating that changes to the Materialize codebase do not degrade query throughput under concurrent load. By running identical SQL workloads against a baseline endpoint (typically a known-good release version) and a test endpoint (such as the current HEAD or a candidate build), the framework produces a quantitative comparison that either confirms acceptable performance or surfaces regressions that require investigation.
Core Principle
The fundamental idea is straightforward: if a given SQL workload achieves N transactions per second at concurrency level C on the baseline, then the same workload on the test endpoint should achieve at least N * (1 - threshold) transactions per second at the same concurrency level. Any result falling below this threshold is flagged as a regression. Any result exceeding N * (1 + threshold) is noted as a significant improvement.
This approach treats performance as a measurable, version-over-version property of the system, subject to the same rigor as functional correctness testing.
Methodology
Workload Selection
The framework defines a taxonomy of SQL workloads organized into four categories:
- DML & DQL workloads -- Pure data manipulation and query statements such as
INSERT,SELECT,UPDATE, and combinations thereof. These measure the steady-state throughput of the query processing pipeline. - DDL workloads -- Schema modification operations including
CREATE TABLE,CREATE MATERIALIZED VIEW, andDROPoperations. These measure the overhead of catalog and dataflow management. - Connection workloads -- End-to-end connection establishment and teardown, including plain, password-authenticated, and SASL-authenticated variants. These measure the latency of the authentication and session initialization path.
- Self-test workloads -- Calibration workloads (empty operations, controlled sleeps) used to validate the measurement infrastructure itself.
Each workload defines a list of operations that are executed repeatedly by concurrent worker threads. Operations are composable: they can be chained via OperationChainWithDataExchange, where the output data of one operation flows into the input of the next, enabling complex multi-step scenarios such as "create table, insert data, query, drop table" within a single logical transaction.
Concurrency Scaling
Benchmarks are executed across exponentially increasing concurrency levels. Starting from a minimum (often 1), the number of concurrent SQL connections grows by a configurable exponent base (e.g., 2.0), producing a series such as 1, 2, 4, 8, 16, 32. At each level, the number of total operations is scaled proportionally to the square root of the concurrency:
operation_count = floor(base_count * sqrt(concurrency))
This scaling ensures that:
- Lower concurrency levels run enough operations to produce stable measurements.
- Higher concurrency levels run proportionally more operations, compensating for the increased variability inherent in highly parallel execution.
A dedicated cursor pool is created for each concurrency level, providing each worker thread with its own pre-established database cursor. This avoids connection overhead contaminating the operation measurements.
Measurement
Each individual operation execution is timed using wall-clock measurement. The framework collects two levels of data:
- Detail-level records -- One record per operation execution, capturing the concurrency level, wall-clock duration, operation type, workload name, and transaction index within the worker.
- Summary-level records -- One record per (workload, concurrency) pair, aggregating total wall-clock time, total operation count, TPS (count / wallclock), and statistical summaries (mean, median, min, max) of individual transaction durations.
All measurements are persisted to CSV files organized by endpoint version and workload name, enabling offline analysis and historical comparison.
Comparison and Regression Detection
The comparison pipeline works as follows:
- Merge: The summary DataFrames from the baseline and test endpoints are joined on matching (concurrency, workload, count) tuples. This produces a merged frame with
TPS_BASELINEandTPS_OTHERcolumns. - Enrich: The merged frame is extended with absolute and percentage TPS differences.
- Threshold: Entries where the test endpoint TPS falls below the baseline TPS by more than the configured deviation threshold are classified as regressions. Entries exceeding the threshold in the positive direction are classified as significant improvements.
- Retry: If a regression is detected, the framework automatically retries the workload (up to 2 additional times by default) to filter out transient noise. Only persistent regressions survive the retry mechanism.
The deviation threshold is expressed as a decimal fraction (e.g., 0.10 for 10%). This is a configurable parameter of the DefaultResultAnalyzer.
Justified vs. Unjustified Regressions
Not all regressions represent bugs. Some are the expected consequence of intentional changes (e.g., a security hardening that adds overhead, or a correctness fix that removes an unsafe optimization). The framework distinguishes between:
- Unjustified regressions -- No known commit between the baseline version and the test version has been flagged as an accepted regression. These are reported as test failures.
- Justified regressions -- A known commit between the two versions has been explicitly recorded as introducing an accepted performance regression. These are logged but do not cause test failures.
This justification mechanism requires both endpoints to reference specific release versions. The lookup uses a curated list of ancestor overrides and accepted regression commits maintained in the repository.
Design Principles
Deterministic Schema Initialization
Before each workload execution at each concurrency level, the framework drops and recreates the benchmark schema from scratch. This ensures that:
- No state leaks between runs at different concurrency levels.
- Index creation and materialized view hydration are completed before measurement begins.
- Each run starts from an identical baseline state.
Endpoint Abstraction
The framework treats all benchmark targets -- local Materialize, remote Materialize, and PostgreSQL -- through a uniform Endpoint interface. This enables:
- Cross-system comparison: Running the same workloads against PostgreSQL to establish an external performance reference.
- Version-to-version comparison: Running the same workloads against two different Materialize releases.
- Local development comparison: Testing a local build against a released version.
Composable Operations
Operations follow a data-flow contract: each operation declares its required input keys and produced output keys. The framework validates these contracts at execution time. This design:
- Prevents silent failures from missing data.
- Enables complex multi-step workflows (e.g., create-populate-query-drop) to be assembled from reusable primitives.
- Makes it straightforward to add new operations without modifying the executor.
Observable Results
All results are captured in structured DataFrames and persisted to CSV. The plotting module generates scatter plots of TPS versus concurrency and distribution plots (violin or box) of individual transaction durations. These visualizations provide immediate insight into:
- How throughput scales with concurrency.
- Whether performance degradation appears at specific concurrency levels.
- The shape of the latency distribution (e.g., long tails indicating contention).
Integration into CI
Scalability regression testing is intended to run as part of the continuous integration pipeline. When a regression is detected and cannot be justified by a known accepted commit, the test produces a TestFailureDetails object that integrates with the standard Materialize test reporting infrastructure. This ensures that performance regressions receive the same visibility and accountability as functional test failures.
Summary
| Aspect | Approach |
|---|---|
| What is measured | Transactions per second (TPS) for categorized SQL workloads |
| How concurrency varies | Exponential scaling with configurable base, min, and max |
| How operations scale | Operation count grows with sqrt(concurrency)
|
| How regressions are detected | TPS percentage difference against a configurable threshold |
| How noise is filtered | Automatic retry (up to 2 retries) on detected regressions |
| How known regressions are handled | Justified via accepted-regression commit lookup between release versions |
| How results are reported | CSV persistence, matplotlib visualizations, CI test failure integration |