Principle:MaterializeInc Materialize Parallel Stress Testing
Principle
Parallel stress testing is the practice of exercising a database system under concurrent, multi-threaded load with randomized operations to expose concurrency bugs, race conditions, and performance regressions that sequential testing cannot reveal. In the Materialize codebase, this principle is realized through two complementary frameworks: the parallel workload framework (correctness-focused randomized testing) and the parallel benchmark framework (performance-focused regression detection).
Motivation
Database systems must handle simultaneous connections executing DDL, DML, and read operations against shared state. Bugs in this domain tend to be non-deterministic and difficult to reproduce because they depend on specific thread interleavings. A single-threaded test suite can verify functional correctness but cannot exercise the locking, coordination, and error recovery paths that activate under concurrent access.
Parallel stress testing addresses this gap by:
- Running multiple worker threads that simultaneously modify and query the database schema and data.
- Randomizing the sequence and combination of operations so that each run explores different interleavings.
- Using seed-based random number generators so that failures can be reproduced deterministically.
- Enforcing bounded resource limits to prevent tests from consuming unbounded memory or storage.
Core Tenets
Weighted Randomized Action Selection
Rather than scripting fixed sequences of operations, parallel stress tests select actions from weighted probability distributions. Each action class (SELECT, INSERT, CREATE TABLE, DROP VIEW, etc.) is assigned a relative weight within its category. Workers draw actions according to these weights, producing workloads that approximate realistic usage patterns while maintaining coverage of rare operations.
This approach ensures that:
- High-frequency operations (reads, inserts) dominate the workload, matching production behavior.
- Low-frequency but high-impact operations (schema changes, cluster swaps, privilege modifications) still execute regularly.
- The weight configuration can be tuned to focus testing on specific areas of concern.
Separation of Correctness and Performance Testing
The Materialize parallel testing strategy separates two distinct goals:
| Concern | Framework | Approach |
|---|---|---|
| Correctness | Parallel Workload | Randomized actions with expected-error suppression; any unexpected error is a test failure |
| Performance | Parallel Benchmark | Structured scenarios with QoS guarantees (minimum QPS, maximum p99 latency); regressions are flagged when thresholds are breached |
This separation allows each framework to optimize for its goal. The workload framework maximizes action diversity and error coverage, while the benchmark framework provides stable, repeatable measurements suitable for trend analysis.
Context-Sensitive Error Classification
Not all errors under concurrent load indicate bugs. When multiple threads race to create and drop objects, expected errors such as "unknown catalog item" or "query could not complete" naturally occur. The parallel stress testing principle requires that each action declare its expected errors based on the current operational context (complexity level, scenario mode).
This classification follows a hierarchy:
- Base errors -- Universally expected (permission denied, numeric overflow, division by zero).
- Complexity-dependent errors -- Expected only under DDL workloads (catalog item not found, concurrent drop, schema changes during transactions).
- Scenario-dependent errors -- Expected only during specific test modes (connection failures during Kill scenario, cancellation errors during Cancel scenario, read-only errors during ZeroDowntimeDeploy).
- Feature-dependent errors -- Expected when specific features are enabled (naughty identifier length errors, WebSocket connection errors).
An error that does not match any expected pattern causes the test to fail, ensuring that genuine bugs are not masked.
Thread-Safe State Management
Parallel stress tests maintain a shared model of the database state (tables, views, schemas, clusters, sources, sinks, roles) that all worker threads read and mutate. Thread safety is enforced through fine-grained locking: each database object has its own threading.Lock, and a global database lock coordinates structural changes.
This design avoids global serialization while preventing data corruption in the test harness itself. Workers can independently select random objects for their operations without holding the global lock, acquiring object-level locks only when mutating state.
Bounded Resource Growth
Stress tests that create objects without limit will eventually exhaust system resources, producing failures that are artifacts of the test rather than real bugs. The parallel stress testing principle enforces maximum counts for every resource type (tables, views, indexes, clusters, sources, sinks, roles, schemas, databases). CREATE actions check the current count before proceeding, and DROP actions keep the population within bounds.
Reproducibility Through Seeded Randomness
Every random decision in a parallel stress test is driven by a seeded random.Random instance. The seed is logged at the start of each run, allowing any failure to be reproduced by re-running with the same seed, thread count, runtime, complexity, and scenario parameters.
# Example invocation parameters for reproducibility
--seed=42 --threads=10 --runtime=300 --complexity=ddl --scenario=regression
Open-Loop and Closed-Loop Load Generation
Performance benchmarks use two complementary load-generation strategies:
- Open-loop: Actions are submitted at a fixed rate regardless of completion, modeling external load that does not back off. This reveals how the system behaves under sustained pressure.
- Closed-loop: The next action starts only after the previous one completes, modeling clients that wait for responses. This measures per-query latency without queuing effects.
Combining both strategies in a single scenario provides a comprehensive view of system behavior under realistic mixed workloads.
Quality-of-Service Guarantees
Performance benchmarks define explicit guarantees for key queries: minimum queries per second (QPS) and maximum p99 latency in milliseconds. These thresholds serve as automated regression gates. When a benchmark run fails to meet its guarantees, the test flags a performance regression that requires investigation.
| Metric | Purpose | Example |
|---|---|---|
| QPS | Minimum throughput under load | "qps": 15 -- at least 15 queries per second
|
| p99 | Maximum 99th-percentile latency | "p99": 400 -- 99% of queries complete within 400ms
|
Operational Scenarios
The stress testing principle encompasses multiple operational scenarios that exercise different system behaviors:
| Scenario | What It Tests |
|---|---|
| Regression | Standard concurrent operations for general correctness |
| Cancel | Query cancellation under concurrent load; verifies that cancelled queries do not corrupt state |
| Kill | Process termination and restart; verifies that the system recovers gracefully and no data is lost |
| Rename | Object and schema renaming under concurrent access; verifies referential integrity during renames |
| BackupRestore | Backup and restore cycles during active workload; verifies data consistency post-restore |
| ZeroDowntimeDeploy | Rolling deployment while queries are in flight; verifies that the transition is transparent to clients |
Relationship to Production Reliability
Parallel stress testing directly maps to production reliability requirements:
- Multi-tenant workloads are simulated by running DDL, DML, and read operations from different workers against shared schemas.
- Source and sink reliability is tested by concurrently creating, dropping, and ingesting through Kafka, Postgres, MySQL, and webhook connectors.
- Cluster management is exercised by creating, dropping, swapping, and resizing clusters while queries are running.
- RBAC correctness is verified by granting and revoking privileges while other threads attempt operations that depend on those privileges.
The combination of randomized correctness testing and structured performance benchmarking provides confidence that Materialize handles concurrent workloads correctly and within acceptable performance bounds.